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Abstract 

In  recent  years,  researchers  have  established  the  viability  of  so  called  hybrid  NN/IIMM 
large  vocabulary,  speaker  independent  continuous  speech  recognition  systems,  where  neu¬ 
ral  networks  (NN)  arc  used  for  the  estimation  of  acoustic  emission  probabilities  for  hidden 
Markov  models  (IIMM)  which  provide  statistical  temporal  modeling.  Work  in  this  direc¬ 
tion  is  based  on  a  proof,  that  neural  networks  can  be  trained  to  estimate  posterior  class 
probabilities.  Advantages  of  the  hybrid  approach  over  traditional  mixture  of  Gaussians 
based  systems  include  discriminative  training,  fewer  parameters,  contextual  inputs  and 
faster  sentence  decoding. 

However,  hybrid  systems  usually  have  training  times  that  arc  orders  of  magnitude 
higher  than  those  observed  in  traditional  systems.  This  is  largely  due  to  the  costly, 
gradient-based  error-backpropagation  learning  algorithm  applied  to  very  largo  neural  net¬ 
works,  which  often  requires  the  use  of  specialized  parallel  hardware. 

This  thesis  examines  how  a  hybrid  NN/IIMM  system  can  benefit  from  the  use  of  mod¬ 
ular  and  hierarchical  neural  networks  such  as  the  hierarchical  mixtures  of  experts  (IIME) 
architecture.  Dased  on  a  powerful  statistical  framework,  it  is  shown  that  modularity  and 
the  principle  of  dividc-and-conquer  applied  to  neural  network  learning  reduces  training 
times  significantly.  We  developed  a  hybrid  speech  recognition  system  based  on  modu¬ 
lar  neural  networks  and  the  state-of-the-art  continuous  density  IIMM  speech  recognizer 
JANUS.  The  system  is  evaluated  on  the  English  Spontaneous  Scheduling  Task  (ESST), 
a  2400  word  spontaneous  speech  database. 

We  developed  an  adaptive  tree  growing  algorithm  for  the  hierarchical  mixtures  of 
experts,  which  is  shown  to  yield  better  usage  of  the  parameters  of  the  architecture  than 
a  pro- determined  topology.  We  also  explored  alternative  paramotcrizations  of  export  and 
gating  networks  based  on  Gaussian  classifiers,  which  allow  even  faster  training  because 
of  near-optimal  initialization  techniques.  Finally,  we  enhanced  our  originally  context 
independent  hybrid  speech  recognizer  to  model  polyphonic  contexts,  adopting  decision 
tree  clustered  context  classes  from  a  Gaussian  mixtures  system. 
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Chapter  1 
Introduction 


Speech  is  the  natural  form  of  communication  for  humans.  We  are  using  it  excessively  in 
our  everyday  life  without  noticing  the  complexity  of  this  form  of  communication.  Speech 
production  is  a  highly  nonlinear  process  that  is  strongly  influenced  by  factors  such  as 
regional  dialects,  age,  gender  and  emotional  state.  Speech  perception  is  even  more  com¬ 
plex,  since  it  involves  a  high  degree  of  variability  through  additional  background  noise, 
different  room  acoustics  and/or  transmission  characteristics  in  case  of  telephone  lines. 
Despite  this  immense  variability,  we  are  able  to  use  this  form  of  communication  even  in 
adverse  environments  such  as  noisy  parties.  In  fact,  speech  is  the  first  and  most  natural 
way  of  communication,  that  we  humans  learn  in  the  very  beginning  of  our  life. 

In  contrast,  communicating  with  a  computer  requires  knowledge  about  how  to  use 
a  mouse  and  a  keyboard  and  how  to  interpret  textual  messages  appearing  in  lots  of 
different  windows.  Most  people  would  prefer  to  use  speech  when  dealing  with  machines 
and  computers.  Some  applications  such  as  information  systems  over  telephone  linos  oven 
require  this  form  of  communication.  There  are  lots  of  other  applications  where  the  users 
hands  arc  busy  doing  other  things  and  speech  is  the  only  reasonable  input  modality. 
Think  about  computers  in  cars  and  airplanes. 

Therefore,  there  has  been  a  large  amount  of  research  in  automatic  speech  recogni¬ 
tion,  understanding  and  translation  since  the  early  19.50’s.  Although  researchers  have 
demonstrated  impressive  results  with  state-of-the-art  hidden  Markov  model  based  sys¬ 
tems,  today’s  speech  recognition  technology  is  still  far  away  from  being  competitive  with 
human  skills.  Current  speech  recognition  systems  perform  very  well  in  very  specific  and 
limited  domains.  Applying  such  systems  to  new  domains  usually  leads  to  unacceptably 
low  performance. 

Automatic  speech  recognition  has  to  be  considered  far  from  being  a  solved  problem 
and  further  improvement  may  require  new  insights  and  the  exploration  of  new  paradigms. 
The  question  is,  what  makes  humans  so  good  in  perceiving,  recognizing  and  understanding 
speech?  Unfortunately  we  arc  also  far  away  from  understanding  the  cognitive  processes 
necessary  to  answer  this  question.  What  wo  do  know  is,  that  information  processing 
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in  the  human  brain  differs  completely  from  the  way  this  is  done  in  traditional  comput¬ 
ers.  The  human  brain  features  billions  of  small  processing  elements  (neurons)  that  are 
interconnected  in  complex  ways  and  are  operating  in  parallel. 

Researchers  attempt  to  simulate  this  kind  of  information  processing  in  a  very  simplified 
way  in  form  of  artificial  neural  networks.  Despite  their  simplicity,  these  networks  have 
boon  applied  succosfully  to  static  pattern  recognition,  very  often  improving  performance 
over  traditional  methods.  They  have  also  boon  used  for  the  recognition  of  speech  sounds, 
though  it  is  still  an  open  question  how  to  apply  them  to  temporal  modeling  necessary 
for  continuous  speech  recognition.  Since  neural  networks  are  very  effective  models  for 
the  discrimination  of  speech  sounds,  researchers  started  to  build  hybrid  systems  that 
combine  the  advantages  of  neural  networks  and  hidden  Markov  models  by  replacing  the 
usual  parametric  density  modeling  by  discriminative  artificial  neural  networks.  Such 
systems  have  recently  began  to  be  competitive  and  sometimes  superior  to  traditional 
speech  recognition  systems. 

Mostly,  neural  networks  are  designed  with  parallel  processing  elements  in  mind,  but 
implemented  on  standard  serial  computers.  Also,  they  are  considered  to  bo  one  big 
monolithic  entity  that  is  trained  and  tested  as  a  unit.  This  renders  the  learning  process 
computationally  very  expensive  and  takes  orders  of  magnitude  longer  than  training  tra¬ 
ditional  density  estimators  for  speech  recognition.  Recently,  modular  and  hierarchically 
organized  neural  networks  have  been  studied  extensively  in  the  neural  network  and  ma¬ 
chine  learning  community  (c.g.  Meta-Pi  networks  [18],  Hierarchical  Mixtures  of  Experts 
[26], [27]).  In  these  networks,  the  overall  recognition  task  is  divided  among  several  small 
sub-networks,  so  called  experts.  The  experts  decisions  arc  integrated  in  a  hierarchical  way, 
yielding  the  overall  network  output.  Training  times  for  such  mixtures  of  experts  systems 
arc  usually  much  smaller  than  those  for  traditional  monolithic  neural  networks. 

In  this  thesis,  we  investigate  modular  neural  networks  for  hybrid  continuous  speech 
recognition  systems,  showing  that  modularity  on  the  network  level  is  a  well  fitting  concept 
for  efficient  and  highly  accurate  neural  network  based  speech  recognition. 

The  thesis  is  organized  as  follows:  Chapter  2  gives  a  short  overview  of  traditional 
neural  networks  and  their  statistical  interpretation.  Chapter  13  reviews  basic  concepts  in 
statistical  continuous  speech  recognition  and  the  extension  to  NN/IIMM  hybrid  speech 
recognition.  Chapter  4  introduces  the  hierarchical  mixture  of  experts  architecture  and 
learning  algorithms  for  this  modular  neural  network.  Chapter  5  gives  a  novel  constructive 
algorithm  for  automatically  growing  a  hierarchical  network  that  improves  performance 
over  static  hierarchies.  Chapter  6  discusses  how  to  model  context  dependent  phones  in 
hybrid  NN /IIMM  systems  and  chapter  7  considers  alternative  parameterizations  for  sub¬ 
networks  in  hierarchical  mixtures  of  experts  and  discusses  advantages.  Finally,  chapter  8 
evaluates  a  hybrid  NN /IIMM  system  based  on  hierarchical  modular  neural  networks  and 
the  JANUS  IIMM  speech  recognizer  that  was  developed  as  part  of  this  thesis.  Chapter  9 
presents  conclusions  and  discusses  enhancements  in  future  work. 


Chapter  2 

Neural  Networks 


This  chapter  will  briefly  review  common  neural  network  architectures  as  far  as  they  are 
important  for  the  remainder  of  this  thesis.  It  finishes  with  an  important  section  on  the 
relationships  between  neural  networks  and  statistical  models. 


2.1  Introduction 

Artificial  neural  networks  are  a  wide  class  of  flexible  nonlinear  regression  and  classifi¬ 
cation  models.  They  consist  of  a  (sometimes  large)  number  of  processing  nodes,  called 
neurons,  which  arc  simple  linear  or  nonlinear  computing  elements.  These  elements  arc 
interconnected  in  a  variety  of  ways  and  often  organized  in  layers.  Fig.  2.1  shows  a  basic 
processing  node  or  neuron. 
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Figure  2.1;  Processing  element  (neuron)  in  neural  networks 
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It  consists  of  an  activation  function  z  =  nct(a;i, . . .  ,a;,v)  :  R  and  a  (possibly) 

non-lincar  output  function  f{z)  :  R  ^  R.  The  most  common  used  activation  functions 
arc 


A' 

nct(a;] , . . . ,  o-'a-)  =  ^  WiXi 

»=i 

A' 

nct(a;i, . . . ,  x,\i)  -  X^(a:i  -  Wif 

i=l 


Choices  for  the  output  function  /  arc  the  identity,  the  sigmoid  or  the  softmax  function 


fi'A  =  ^ 


m  = 


1 

1  +  CXp(-2) 


/(«.) 


cxp(g,) 
ELi  cxp(^,,) 


for  a  layer  of  n  neurons.  Associated  with  each  neuron  is  a  weight  vector  w  = 
(lui, . . . ,  ioa')-  Sometimes,  an  additional  bias  weight  wo  with  a  fixed  input  value  of  1 
is  used  in  order  to  extent  the  model  from  linear  to  affine  transformations.  Learning 
algorithms  for  neural  networks  estimate  these  weights  (mostly)  iteratively,  in  order  to 
minimize  a  given  error  function  of  the  outputs. 

The  most  simple  neural  network  architecture  is  a  pcrceptron  which  may  consist  of  just 
one  neuron.  It  can  be  trained  to  discriminate  between  linearly  separable  classes  using  the 
sigmoid  or  softmax  non-linearity  as  output  function.  However,  for  more  complex  discrim¬ 
ination  or  approximation  tasks,  networks  with  multiple  layers  of  neurons  arc  necessary. 
The  next  two  sections  describe  the  most  commonly  used  neural  network  architectures  for 
complex  tasks  and  their  learning  algorithms. 


2.2  Multi  Layer  Perceptrons 

A  multi  layer  pcrceptron  (MLP)  consists  of  several  layers  of  neurons  with  full  intercon¬ 
nections  between  neurons  in  adjacent  layers  (additional  interconnections  between  non- 
adjacent  layers  arc  called  shortcut  connections).  Fig.  2.2  depicts  the  structure  of  such  an 
architecture.  Input  data  is  presented  to  the  network  at  the  input  layer,  which  contains 
no  processing  nodes.  It  serves  only  as  a  data  source  for  the  following  hidden  laycr(s). 
Finally,  the  networks  output  is  computed  by  neurons  in  the  output  layer.  The  activation 
function  of  all  neurons  is  the  inner  product  between  input  and  weight  vectors.  Only  the 
activation  of  nodes  in  the  input  and  output  layers  is  directly  observable.  The  nodes  in 
hidden  layers  compute  internal  representations  of  the  data. 

MLP’s  arc  useful  for  supervised  pattern  recognition  where  the  task  is  to  learn  a  map¬ 
ping  between  inputs  x  and  outputs  t  given  a  set  of  training  examples 


T  =  {(xi,ti),...,(XN,tN)} 
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Figure  2.2:  Multi  Layer  Perceptron  (MLP) 


In  the  training  phase,  the  weights  of  an  MLP  are  usually  updated  by  an  iterative 
learning  algorithm  called  error  backpropagation.  After  this  procedure  converges,  the  MLP 
can  be  used  to  map  new  (unseen)  patterns. 

The  crror-backpropagation  learning  algorithm  Is  based  on  the  chain  rule  for  derivatives 
of  continuous  functions.  The  algorithm  consists  of  a  forward  pass,  in  which  training  ex¬ 
amples  X  arc  presented  to  the  network  and  activations  of  output  neurons  y  arc  computed. 
This  is  followed  by  a  backpropagation  step  which  updates  the  weights  of  neurons  using 
the  gradient  of  an  error  function  such  as  the  mean  squared  error  or  the  cross  entropy 
between  network  outputs  y  and  given  target  outputs  t. 

For  example,  using  the  mean  squared  error  E  =  0.5  ^  the  sigmoid 

output  function  /(y,)  =  1/(1  +cxp(— jk,))  with  Zi  =  where  hj  arc  the  activations 

of  the  hidden  layer,  the  gradient  with  respect  to  the  weights  of  neurons  in  the  output 
layer  wy  is 

’  =  ifIX’ 

i 

Thus,  weights  in  the  output  layer  can  be  updated  as  follows  in  order  to  minimize  the 
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error  function: 

t 

The  derivative  of  the  error  function  with  respect  to  weights  in  the  hidden  layer  wjt 
can  be  computed  using  the  chain  rule  which  yields 


t 

This  leads  to  the  following  update  rule  for  weights  in  the  hidden  layer: 

t 

It  is  easy  to  generalize  the  backpropagation  algorithm  to  networks  with  more  than  one 
hidden  layer  of  neurons.  It  should  be  noted  that  there  are  lots  of  extensions  of  the  basic  al¬ 
gorithm  such  as  an  additional  momentum  term  which  aim  at  improving  convergence  speed 
and  final  performance.  Nevertheless,  the  backpropagation  algorithm  is  computationally 
very  expensive,  especially  for  large  MLP’s. 

It  can  be  shown  that  MLP’s  with  at  least  one  hidden  layer  can  approximate  any 
continuous  function  to  any  desired  degree  of  accuracy,  if  there  are  enough  hidden  neurons 
available  (this  property  is  called  universal  function  approximation).  Thus,  MLP’s  with  one 
hidden  layer  are  sufficient,  although  additional  hidden  layers  may  improve  performance 
over  single  hidden  layer  networks  with  an  equal  number  of  neurons  through  increased 
model  complexity. 


dE 

dwjk 


2.3  Radial  Basis  Function  networks 

In  Radial  Basis  Function  networks  (RBF),  the  hidden  layer  neurons  compute  radial  basis 
functions  of  the  inputs,  similar  to  kernel  functions  in  kernel  regression.  RBP  networks 
consist  of  input,  one  hidden  and  output  layer.  The  activation  function  of  hidden  neurons 
computes  the  Euclidean  or  Mahalanobis  distance  d  between  input  and  weight  vectors. 
Usually,  the  output  function  of  hidden  layer  neurons  is 

hi  =  oxp(-y) 

The  output  layer  neuron’s  activation  function  is  the  same  as  the  one  used  for  MLP’s, 
the  inner  product  of  input  and  weight  vector  (with  an  additional  bias  input).  The  RBF 


2.1  RADIAL  BASIS  FUNCTION  NETWORKS 


7 


Figure  2.3:  Radial  Basis  Function  (RBF)  network 


network  is  mostly  used  for  regression  with  a  linear  output  layer  although  it  is  also  possible 
to  use  it  for  classification  with  a  sigmoid  or  softmax  output  layer. 

Fig.  2.3  shows  the  structure  of  a  RBF  network.  RBF  hidden  neurons  arc  often  called 
localized  receptive  fields  because  of  the  special  form  of  their  activation  function.  Sometimes 
the  outputs  of  the  hidden  layer  neurons  arc  normalized  to  sum  up  to  one  as  in  kernel 
regression. 

Training  of  RBF  networks  proceeds  in  two  steps: 

1.  RBF  estimation  for  hidden  neurons  Input  feature  vectors  arc  clustered  accord¬ 
ing  to  the  desired  number  of  hidden  neurons  using  a  procedure  such  as  k-mcans, 
LBG  or  neural  gas.  This  results  in  a  set  of  RBF  centers.  If  the  model  assumes 
a  bandwidth,  variance  vector  or  covariance  matrix  for  the  hidden  neurons,  these 
parameters  may  bo  estimated  using  the  data  within  each  cluster. 

2.  Linear  Least  Squares  for  output  weight  matrix  Once  the  parameters  of  the 
hidden  neurons  arc  computed,  they  remain  fixed  and  the  estimation  of  the  weights 
of  the  (linear)  output  neurons  reduces  to  a  linear  least  squares  problem  which  can 
be  solved  by  the  standard  matrix  inversion  algorithm. 

RBF  networks  can  be  trained  much  faster  than  MLP’s,  but  it  was  shown  that  kernel 
methods  such  as  RBF  networks  tend  to  require  larger  sample  sizes  to  achieve  the  same 
performance,  especially  in  high  dimensional  feature  spaces. 
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2.4  Statistical  Interpretation 

Neural  networks  and  statistical  models  arc  not  competing  methodologies  for  data  analysis. 
There  is  considerable  overlap  between  the  two  fields.  Statistical  methodology  is  directly 
applicable  to  most  neural  network  models,  resulting  in  more  efficient  parameter  estima¬ 
tion  and  optimization  (learning)  algorithms.  Additionally,  statistical  methods  provide 
diagnostic  tools  such  as  confidence  intervals  and  hypothesis  testing  which  arc  missing  in 
the  field  of  neural  networks. 

Recently,  statisticians  published  works  which  established  tics  between  statistics  and 
neural  networks,  sometimes  showing  the  equivalence  of  statistical  and  neural  network 
models.  Sarlc  [48]  shows  relationships  between  many  neural  networks  and  statistical 
models  and  translates  the  jargons  in  the  two  fields.  Ripley  [47]  provides  a  very  interesting 
overview  of  the  similarities  of  neural  networks  and  statistical  models. 


2.4.1  Perceptrons 

A  pcrccptron  with  a  linear  output  function  computes  a  linear  combination  of  the  input 
features.  It  is  nothing  else  but  a  linear  regression  model  that  can  be  fit  most  efficiently 
by  linear  least  squares. 

In  case  the  output  function  is  nonlinear,  a  pcrccptron  is  a  generalized  linear  model 
(GLIM)  with  the  exception  that  for  a  pcrccptron,  the  nonlinearity  is  mostly  chosen  ad 
hoc,  while  the  nonlinearity  of  a  GLIM  is  fixed,  once  a  probabilistic  model  of  the  outputs 
given  the  inputs  is  chosen.  GLIM’s  arc  fitted  by  maximum  likelihood  methods  for  a 
variety  of  distributions  of  the  exponential  family.  For  multiway  classification,  one  usually 
assumes  a  multinomial  (Poisson)  density  model,  which  forces  the  use  of  the  softmax 
nonlinearity  as  output  function  for  the  GLIM/pcrccptron.  It  is  considerably  more  effective 
to  use  maximum  likelihood  fitting  than  mean  square  error  minimization  to  estimate  the 
parameters  of  a  pcrccptron.  This  fact  is  important  for  modular  neural  networks  with 
simple  pcrccptron-likc  processing  elements,  such  as  the  architecture  that  we  will  introduce 
later  in  this  thesis. 

2.4.2  Multi  Layer  Perceptrons 

Like  a  pcrccptron,  a  MLP  has  counterparts  in  statistics  as  well,  depending  on  the  number 
of  hidden  layers  and  the  number  of  neurons  in  the  hidden  layers.  Sarlc  [48]  categorizes 
MLP’s  into  the  following  three  groups: 

•  Small  number  of  hidden  neurons.  MLP  can  be  considered  as  a  parametric 
model  such  as  polynomial  regression. 

•  Moderate  number  of  hidden  neurons.  MLP  can  be  considered  a  quasi-paramctric 
model  similar  to  projection  pursuit  regression. 
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•  Large  number  of  hidden  neurons,  possibly  increasing  with  the  sample  size. 

MLP  can  bo  considered  as  a  nonparamotric  sieve. 

It  is  this  smooth  transition  between  parametric  and  nonparamotric  models  that  ren¬ 
ders  MLP’s  especially  useful.  The  crror-backpropagation  learning  algorithm  for  MLP’s 
is  iterative,  slow  and  requires  the  careful  adaptation  of  various  learning  parameters  such 
as  the  learning  rate  and  the  momentum  factor  by  trial  and  error.  Since  MLP’s  perform 
multivariate  multiple  nonlinear  regression,  its  parameters  may  bo  estimated  much  more  ef¬ 
ficiently  using  nonlinear  optimization  algorithms  such  as  those  used  for  projection  pursuit 
models. 

2.4.3  Unsupervised  Learning 

Unsupervised  learning  for  neural  networks  consists  in  extracting  useful  features  from  the 
input  data  and  eliminating  redundancy,  without  having  any  target  or  output  vectors 
associated  with  each  input  vector.  Ftom  a  statistical  point  of  view,  things  arc  different. 
The  goal  in  most  forms  of  unsupervised  learning  is  to  estimate  feature  variables  from 
which  the  observed  data  can  be  predicted.  In  this  formulation,  the  observed  data  is 
considered  to  bo  both  input  and  target  of  the  learning  process. 

Unsupervised  Ilebbian  learning  for  a  one  layer  linear  network,  for  example,  is  identical 
to  principal  component  analysis,  which  provides  the  optimal  transformation  matrix.  This 
fact  is  well-known  from  statistical  theory  and  many  variations  of  Ilebbian  learning  consist 
of  inefficient  approximations  of  principal  component  analysis. 
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Chapter  3 

Hybrid  Speech  Recognition 


This  chapter  will  first  review  the  basic  concepts  of  today’s  state-of-the-art  speech  recog¬ 
nition  technology  based  on  hidden  Markov  models.  It  will  then  discuss  advantages  and 
drawbacks  and  shortcomings  of  this  approach  which  motivate  hybrid  speech  recognition 
systems.  The  term  hybrid  speech  recognition  systems  is  now  widely  used  for  systems  that 
try  to  bring  together  the  best  of  two  worlds:  Statistical  time  alignment  by  hidden  Markov 
models  and  discriminative  observation  probability  estimation  by  neural  networks  instead 
of  by  moans  of  parametric  multimodal  distributions.  We  will  briefly  discuss  two  such 
systems,  one  based  on  the  multi  layer  pcrcoptron  (MLP)  and  one  based  on  recurrent 
neural  networks  (RNN),  as  they  are  currently  being  investigated  by  researchers  in  the 
speech  community.  Finally,  we  will  discuss  problems  observed  with  large  monolithic  neu¬ 
ral  networks  as  used  in  practical  implementations  of  hybrid  speech  recognition  systems, 
motivating  the  exploration  of  modular  and  hierarchical  neural  networks  for  hybrid  speech 
recognition. 


3.1  Speech  Recognition 

This  section  gives  a  quick  overview  of  current  hidden  Markov  model  (IIMM)  based  speech 
recognition  technology  as  it  is  used  in  almost  all  current  state-of-the-art  speech  recognition 
systems.  Readers  already  familiar  with  these  concepts  may  want  to  skip  to  the  next 
section. 


3.1.1  Overview 

Pig.  3.1  shows  the  basic  setup  of  a  speech  recognition  system  revealing  all  its  major 
components. 

Input  to  the  system  is  a  sampled  waveform  of  the  audio  signal  as  recorded  by  a  micro¬ 
phone.  Note  that  the  room  characteristics,  the  kind  of  microphone  and  A/D  transducers 
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that  arc  used  to  record  the  audio  signal  can  have  a  severe  effect  on  the  speech  represen¬ 
tation  and  recognition.  Recently,  large  efforts  have  been  put  into  developing  so  called 
robust  systems,  which  tolerate  different  kinds  of  microphones,  room  characteristics  and 
noise  conditions  in  the  prcpoccssing  stage.  This  stage  is  sometimes  called  feature  extrac¬ 
tion  or  front-end.  It  computes  a  sequence  of  features,  mostly  derived  from  spectral  or 
ccpstral  representations  of  speech,  which  arc  more  suitable  for  the  following  stages  than 
the  raw  speech  waveform.  The  acoustic  modeling  stage  models  a  set  of  speech  sounds  by 
hidden  Markov  models  and  (mostly)  continuous  parametric  distributions.  For  any  given 
observation  at  any  time  step,  the  acoustic  modeling  stage  provides  local  probabilities  for 
each  of  the  modeled  atomic  sound  units.  These  local  scores  arc  then  used  in  a  dynamic 
programming  search  ( decoder)  stage,  to  determine  the  most  likely  sequence  of  words,  given 
the  acoustics.  Additional  information  about  prior  probabilities  of  sequences  of  words  is 
supplied  to  the  decoder  by  the  language  model.  We  will  now  go  into  some  details,  con¬ 
cerning  the  basic  blocks  of  a  speech  recognizer,  but  we  can  not  provide  an  exhaustive 
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overview  of  this  field.  See  [52], [32], [51]  for  additional  information. 

3.1.2  Preprocessing 

Speech  signals  have  been  observed  to  have  stationary  properties  over  periods  no  longer 
than  about  20ms.  Therefore,  most  speech  recognition  frond-ends  use  a  sliding  window 
of  between  5ms-20ms  to  extract  a  vector  of  features  from  the  speech  waveform.  Such 
vectors  arc  called  frames  and  arc  typically  extracted  at  a  rate  of  about  lOOIIz.  The 
ultimate  preprocessing  stage  should  generate  a  representation  of  the  speech  signal,  that  (1) 
compresses  the  speech  signal  as  far  as  possible,  without  loosing  any  information  necessary 
for  the  recognition  afterwards  and  (2)  facilitates  discrimination  between  different  speech 
sounds.  Fig.  3.2  shows  the  sequence  of  operations  usually  applied  to  the  speech  waveform 
in  order  to  compute  spectral  or  cepstral  features. 

Speech 

Window 
function 


Feature 

Vector 


Figure  3.2;  Preprocessing  for  Speech  Recognition 

The  speech  signal  is  multiplied  with  a  window  function,  then  a  discrete  Fourier  trans¬ 
form  (DFT)  and  the  power  spectrum  is  computed.  The  cepstrum  is  computed  by  applying 
the  logarithm  and  an  inverse  discrete  Fourier  Transform  (IDFT)  to  the  spectrum.  Often, 
additional  steps  such  as  the  following  are  applied: 

•  CMN  (cepstral  mean  normalization)  The  idea  behind  this  technique  is,  that 
the  observed  audio  signal  is  a  linear  superimposition  of  speech  and  noise,  which  is 
preserved  in  the  cepstral  domain.  By  subtracting  the  cepstral  mean  over  a  whole 
utterance,  the  additive  stationary  parts  of  the  cepstrum  arc  removed. 

•  LDA  (linear  discriminant  analysis)  This  technique  has  proven  very  useful  to 
reduce  the  dimensionality  of  feature  vectors.  It  applies  a  linear  transformation  that 
minimizes  intra-class  distance  while  maximizing  inter-class  distance.  Dimensional¬ 
ity  reduction  is  achieved  by  dropping  coefficients  in  the  resulting  feature  vectors 
according  to  their  significance.  Often,  multiple  frames  arc  concatenated  prior  to  the 
application  of  LDA  to  include  contextual  information  to  the  resulting  features. 

•  VTLN  (vocal  tract  length  normalization)  Different  speakers  have  different 
vocal  tract  lengths.  Different  vocal  tract  lengths  imply  different  pitch  and  for¬ 
mant  frequencies  for  different  speakers.  This  is  usually  compensated  by  a  linear  or 
piocewisc-lincar  warping  of  the  frequency  axis  in  the  spectrum  based  on  statistics 
of  formant  frequencies. 
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•  PLP  (perceptual  linear  prediction)  Performs  several  psychophysically  based 
spectral  transformations.  It  is  based  on  the  all-pole  filter  model  used  in  Linear 
Predictive  Analysis  (LPA). 

3.1.3  Hidden  Markov  Models 

IIMMs  model  a  sequence  of  observations  (in  our  ease  a  sequence  of  feature  vectors)  as 
a  piecewise  stationary  process.  A  discrete  IIMM  is  a  stochastic  finite  state  automaton 
A  =  with  a  sot  of  stationary  states  S,  a  transition  probability  matrix  A,  a 

emission  probability  m&tnx  B  and  a  set  of  initial  state  probabilities  tt.  Usually,  speech 
recognition  systems  use  strictly  left  to  right  IIMMs  to  model  words,  sylablos,  phonemes 
or  sub-phonetic  units.  Often,  words  are  modeled  as  a  sequence  of  phonemes,  which  in 
turn  arc  modeled  as  a  sequence  of  IIMM  states.  Fig.  3.3  shows  the  topology  of  a  typical 
phoneme  IIMM. 


Figure  3.3:  Hidden  Markov  Model  topology  for  phonemes 

Different  states  in  a  phonetic  IIMM  model  different  stationary  acoustic  sounds  at  the 
beginning,  middle  and  end  of  a  phoneme.  Viewing  the  IIMM  as  a  generative  model,  the 
term  ’hidden’  becomes  clearer.  IIMMs  consist  of  two  concurrent  stoachastic  processes. 
One  is  the  un-obscrvablc  sequence  of  states  that  models  the  temporal  structure  of  speech, 
the  other  is  the  observable  sequence  of  emitted  output  symbols  in  each  state,  modeling 
the  the  locally  stationary  character  of  speech  sounds.  There  arc  three  problems  arising, 
when  using  IIMMs  to  model  sequences  of  observations; 

Evaluation  What  is  the  probability  that  a  given  IIMM  generated  a  given  sequence  of 
observations. 

Decoding  Given  a  sequence  of  observations  and  a  IIMM,  what  is  the  most  likely  se¬ 
quence  of  states  through  the  IIMM  that  lead  to  the  generation  of  the  observations. 

Parameter  estimation  Given  a  IIMM  and  a  set  of  observation  sequences  to  be  modeled 
by  this  IIMM,  how  can  we  adapt  the  parameters  (emission  and  transition  probability 
distributions)  of  the  model  to  maximize  the  likelihood  of  generation. 
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All  of  the  above  three  problems  have  very  efficient  solutions  in  form  of  special  eases 
of  dynamic  programming  algorithms.  For  instance,  the  evaluation  problem  occurs  in 
isolated  word  recognition  where  we  want  to  score  different  word  IIMMs  according  to  their 
likelihood.  It  can  be  solved  using  the  Forward  algorithm.  The  decoding  problem  occurs 
in  continuous  speech  recognition  where  we  are  seeking  the  most  probable  path  through  a 
very  largo  IIMM  consisting  of  all  possible  sequences  of  basic  sound  units.  Once  we  found 
this  path,  we  can  derive  the  most  probable  sequence  of  phonemes  or  words.  The  decoding 
problem  can  be  solved  using  the  Viterbi  algorithm.  Fig.  3.4  shows  a  typical  trellis  diagram 
with  the  optimal  path  as  a  Viterbi  algorithm  would  produce  it.  The  diagram  also  shows 
all  possible  state  transitions  at  one  specific  time  point. 


Figure  3.4:  State  trellis  and  the  Viterbi  algorithm 


The  last  problem,  also  called  the  training  problem,  can  be  solved  by  the  Forward- 
Backward  or  Daum-Weleh  algorithm,  which  is  essentially  a  version  of  the  Expectation- 
Maximization  (EM)  [10]  algorithm.  In  the  case  of  left-right  IIMMs  with  a  constant  small 
number  of  transitions  in  each  state,  all  three  algorithms  have  a  computational  complexity 
of  only  0{NT),  whore  N  is  the  number  of  states  in  the  IIMM  and  T  is  the  number  of 
observations. 


3.1.4  Acoustic  Modeling 

Today’s  state-of-the-art  speech  recognition  systems  use  parametric  multimodal  probabil¬ 
ity  densities  to  model  continuous  observations  instead  of  discrete  observations  as  required 
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in  the  standard  IIMM.  It  was  shown  empirically,  that  such  systems  yield  better  perfor¬ 
mance  than  systems  based  on  vector-quantization  derived  discrete  observation  symbols. 
Continuous  densities  arc  mostly  modeled  by  mixtures  of  Gaussians,  since  it  was  shown 
that  these  mixture  models  can  approximate  any  kind  of  distribution,  given  enough  data 
to  estimate  its  parameters  reliably.  In  a  continuous  density  IIMM,  the  probability  of 
observation  vector  x  in  a  state  Si  is  modeled  by 


(x)  with  Nij{x)  =  -=== 

The  Forward- Backward  algorithm  can  be  extended  to  continuous  density  observations 
which  yields  update  formulas  for  the  parameters  (mixture  weights),  /iy  (means)  and 
Sij  (covariance  matrices). 

If  there  is  not  enough  training  data  to  estimate  a  separate  mixture  of  Gaussians  for 
each  state  of  large  IIMMs,  one  can  share  parameters  among  dilferent  states,  so  that  they 
use  the  same  set  of  Gaussians  but  with  different  mixture  weights.  This  form  of  parameter 
tying  is  known  as  scmi-continuous  density  modeling  (SCIIMM).  For  example,  there  is 
a  special  case  of  this  kind  of  modeling,  called  phonetically  tied  semi-continuous  density 
modeling  (PTSCIIMM)  where  all  the  states  of  a  phonetic  IIMM  share  the  same  set  of 
Gaussian  densities.  Other  forms  of  parameter  sharing  include  state  clustering  and/or 
decision  tree  clustering. 

Another  issue  is  the  modeling  of  context-dependency  on  the  IIMM  level.  It  was  shown 
(see  for  example  [32])  that  the  explicit  modeling  of  phonemes  in  different  contexts  by 
different  IIMMs  yields  a  vast  improvement  over  context-independent  systems.  Current 
systems  model  biphonc,  triphone  or  even  polyphone  contexts  to  account  for  the  variability 
of  speech  sounds  in  different  contexts.  Since  the  average  number  of  monophones  used  in 
a  typical  system  ranges  around  50,  n-phonc  contexts  would  require  the  modeling  of  50" 
different  acoustic  models.  This  clearly  is  not  feasible  in  practice,  especially  since  many 
contexts  occur  rarely  or  even  never  in  a  given  training  corpus.  The  solution  to  this 
problem  is  the  use  of  decision  tree’s  with  a  set  of  phonetic  context  questions  to  cluster 
the  polyphonic  contexts  into  a  reasonably  small  sot  of  context  classes,  which  arc  then 
modelled  by  separate  IIMM’s.  See  [44]  for  an  introduction  to  decision  trees. 

3.1.5  Decoding/Search 

The  decoder  is  the  essential  recognition  part  of  a  speech  recognizer.  It  uses  locally  com¬ 
puted  emission  probabilities  to  find  the  most  likely  sequence  of  words  in  a  dynamic  pro¬ 
gramming  fashion.  Typical  large-vocabulary  continuous-speech  recognition  tasks  today 
involve  a  vocabulary  of  20k  to  50k  words.  Additionally,  context-dependent  modeling 
yields  over  10k  of  context- dependent  phoneme  models.  Clearly,  the  standard  Viterbi  al¬ 
gorithm  for  finding  the  most  likely  sequence  of  IIMM  states  is  not  applicable  without 
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modifications,  because  of  the  combinatorical  explosion  of  the  size  of  the  search  space. 
Therefore,  most  decoders  arc  organized  in  a  multi-pass  strategy,  applying  more  detailed 
models  in  succcsivc  passes  with  restricted  search  spaces.  Most  decoders  arc  based  on 
either  time-synchronous  Viterbi  beam  search  or  stack  decoding,  which  is  essentially  an 
/I*  search. 

Viterbi  beam  search  is  a  modified  Viterbi  algorithm,  where  active  states  arc  pruned 
at  each  time  step,  based  on  either  their  cummulative  score  or  on  their  ranking  in  a  list 
sorted  by  cummulative  score.  This  way,  only  a  very  limited  number  of  state,  phone  and 
word  transitions  (50-200)  arc  considered  at  each  time  step.  A  disadvantage  of  the  Viterbi 
beam  search  is  the  time-synchronous  left-to-right  mode  of  operation  which  may  lead  to 
recognition  errors  because  a  lot  of  hypotheses  are  being  pruned  away  based  on  just  the 
beginning  part  of  the  actual  utterance  although  the  remaining  part  may  suggest  to  keep 
the  hypotheses. 

A  stack  decoder  is  a  non  time-synchronous  search  algorithm,  comparing  incomplete 
paths  of  different  lengths  by  means  of  a  likelihood  function  that  estimates  the  probability 
of  the  most  likely  remaining  paths.  The  basic  data  structure  used  in  this  kind  of  search  is 
a  stack  which  contains  a  sorted  list  of  active  incomplete  paths  together  with  their  score. 
At  each  iteration  of  the  search,  the  top  entry  is  examined  and  all  possible  extensions  of 
the  associated  incomplete  path  arc  evaluated  and  inserted  in  the  stack.  The  accuracy  of 
this  algorithm  clearly  depends  on  the  size  of  the  stack.  Often,  a  stack  decoder  is  used  as 
a  second  search  pass,  following  a  Viterbi  beam  search  that  restricts  the  search  space  and 
provides  estimates  of  probabilities  of  partial  paths. 

Other  important  search  techniques,  especially  in  the  case  of  large  vocabularies,  include 
the  organization  of  the  pronunciation  lexicon  in  form  of  a  phonetic  prefix  tree.  Since  many 
words  start  with  the  same  sequence  of  phonemes,  the  storage  requirements  can  be  reduced 
significantly  using  this  approach. 

Usually  the  output  of  the  decoder  is  not  only  a  single  best  scored  hypothesis  for  a 
given  utterance,  but  a  list  of  the  first  u-best  hypotheses  or  a  word  graph  (word  lattice) 
which  can  be  subject  to  further  processing. 


3.1.6  Language  Modeling 


The  task  of  automatic  speech  recognition  is  to  find  the  most  probable  word  sequence  w 
given  a  sequence  of  acoustic  observations  x,  which  is  the  maximum  posterior  sentence 
probability.  According  to  Bayes  rule,  it  can  be  decomposed  into 


maxp(w,jx) 

i 


=  max 


p(x|W;)P(w,) 

p(x) 


The  denominator  can  be  neglected  since  it  is  constant  for  all  w,  and  p(x|wi)  is  com¬ 
puted  by  the  acoustic  model.  It  remains  to  provide  a  means  for  estimating  prior  sentence 
probabilities  T’(Wi).  Those  probabilities  arc  computed  by  the  language  model  and  can  be 
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used  in  the  decoder  and/or  in  subsequent  rcscoring  passes  based  on  n-best  lists  or  word 
graphs. 

Wo  will  not  go  into  much  detail  here  and  only  describe  the  very  basics  of  language 
modeling,  that  is,  statistical  n-gram  modeling.  The  basic  assumption  hero,  is  that  proba¬ 
bilities  of  words  in  a  sentence  are  only  depending  on  the  previous  n  —  1  words.  The  prior 
probability  of  a  given  sentence  can  then  be  factored  as  follows: 

in  in 

^’(w)  ==  n  n  piwk\wk-i,---,wi,-n+i) 

k-l  k=l 

In  case  of  a  bigram  model,  we  have  to  estimate  probabilities  in  case  of 

a  trigram  model,  wo  have  to  estimate  probabilities  p(wk\wk~i,Wk-2)-  This  can  be  done 
by  scanning  large  text  corpora  and  counting  occurances  of  word  pairs  or  word  triples, 
respectively.  Since  many  trigrams  that  may  be  encountered  in  a  test  sentence  do  not 
occur  in  even  the  largest  text  corpora,  we  have  to  use  a  smoothing  technique  which 
avoids  word  probabilities  of  zero.  The  standard  procedure  hero  is  to  use  a  weighted  sum 
of  unigram,  bigram  and  trigram  probabilities  where  the  weights  are  determined  by  an 
algorithm  called  deleted  interpolation.  Despite  the  simplicity  of  this  approach,  it  was 
proven  to  work  very  well  for  largo  vocabulary  continuous  speech  recognition. 


3.2  Discussion 

This  section  discusses  advantages  and  drawbacks  of  the  traditional  IIMM  based  speech 
recognition  systems,  as  they  have  boon  described  in  the  previous  sections. 

Advantages: 


•  Rich  mathematical  framework  IIMM’s  are  based  on  a  flexible  statistical  theory 
which  allows  to  build  even  large  systems  consistently. 

•  Efficient  learning  and  decoding  algorithms  These  algorithms  handle  sequences 
of  observations  probabilistically  and  they  do  not  require  an  explicit  hand  segmen¬ 
tation  in  terms  of  the  basic  speech  units.  They  can  be  implemented  very  efficiently 
even  for  very  largo  systems. 

•  Easy  integration  of  multiple  knowledge  sources  Different  levels  of  constraints 
(e.g.  phonological  and  syntacical)  can  be  incorporated  within  the  IIMM  framework 
as  long  as  those  arc  expressed  in  the  same  in  terms  of  the  same  statistical  formalism. 


Disadvantages: 
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•  Poor  discrimination  Estimation  of  the  parameters  of  IIMM’s  is  based  on  likeli¬ 
hood  maximization.  This  moans,  only  correct  models  receive  training  information, 
incorrect  models  do  not  got  any  feedback. 

•  1st  order  Markov  assumption  Current  observations  and  state  transitions  are 
depending  only  on  the  previous  state  all  other  history  is  neglected. 

•  Independence  assumptions  Consecutive  feature  vectors  arc  assumed  to  be  sta¬ 
tistically  independent. 

•  Require  distributional  assumptions  For  example,  modeling  acoustic  observa¬ 
tions  by  mixture  of  Gaussians  with  diagonal  covariance  matrices  requires  uncorre- 
latcd  feature  coefficients,  which  is  not  the  case. 

•  Assumption  that  speech  is  a  piecewise  stationary  process  All  representa¬ 
tional  power  goes  into  the  modeling  of  stationary  parts  of  speech,  although  it  is 
known  that  speech  should  rather  be  modeled  as  a  sequence  of  transitions  or  trajec¬ 
tories  in  the  feature  space.  This  is  somehow  alleviated  by  incorporating  delta  and 
delta  delta  features  into  the  process  of  feature  generation. 

•  Assumption  of  exponential  state  duration  distributions  This  assumption  is 
an  integral  part  of  1st  order  IIMM’s.  It  can  only  be  circumvent  by  applying  explicit 
state  duration  modeling,  that  is,  imposing  external  duration  distributions  such  as  a 
gamma  distribution. 

•  Maximum  likelihood  based  This  is  a  disadvantage  because  maximum  likelihood 
estimation  always  relics  on  the  correctness  of  the  models  which  is  simply  not  true 
in  the  case  of  speech  recognition. 

•  Complexity  All  of  the  above  disadvantages  require  additonal  modifications  and 
enhancements  of  the  basic  IIMM  technology  that  lead  to  complex  heuristics  based 
systems. 


3.3  Hybrid  Speech  Recognition 

Hybrid  speech  recognition  systems  try  to  attack  some  of  the  disadvantages  of  traditional 
IIMM’s  while  still  adhering  to  the  general  statistical  formalism.  In  particular,  since  these 
methods  use  neural  networks  as  emission  probability  csimators,  training  is  based  on  pos¬ 
terior  class  probabilities  instead  of  maximum  likelihood.  Neural  network  classifiers  arc 
discriminative  in  nature  and  do  not  impose  constraints  such  as  uncorrolatcd  feature  coef¬ 
ficients  although  they  arc  not  free  of  distributional  assumptions  as  shown  in  the  previous 
chapter. 
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3.3.1  Neural  Networks  as  Statistical  Estimators 

It  was  shown  that  neural  networks  such  as  MLP’s  can  be  trained  to  compute  estimates 
of  the  posterior  class  probabilities  p(u;i|x),  given  an  input  vector.  IIMM’s  require  the 
computation  of  likelihoods  p(x,qi)  for  hypothesized  states  5,-.  Fortunately,  we  can  apply 
Bayes  rule  to  convert  posteriors  into  scaled  likelihoods  that  can  then  be  used  as  observation 
probabilities; 


xki)  = 

In  the  above  equation,  P{qi)  is  the  prior  probability  of  state  qi  and  the  neural  network 
must  be  trained  to  produce  estimates  of  posterior  state  probabilities  p(g'i|x).  This  means, 
wo  need  to  train  a  neural  network  which  has  as  many  output  nodes  as  there  are  IIMM 
states.  We  can  compute  scaled  likelihoods  by  dividing  the  network  outputs  by  the  prior 
state  probabilities. 

It  should  be  noted  that  in  theory,  IIMM’s  could  also  be  trained  using  local  posterior 
probabilities  as  omission  probabilities.  In  [2],  an  iterative  procedure  based  on  the  EM 
algorithm  is  used  to  compute  local  estimates  of  posterior  class  probabilities  which  can 
be  used  as  ’soft’  targets  for  neural  networks.  This  approach  aims  at  optimizing  the 
global  posterior  probability  for  the  sequence  of  word  models,  instead  of  maximizing  the 
likelihood. 

To  keep  the  number  of  states  low  enough  to  train  a  large  neural  networks  in  a  rea¬ 
sonable  amount  of  time,  most  researchers  first  experimented  with  context-independent 
IIMM  systems  with  one-state  phonemic  IIMM’s.  In  this  case,  the  number  of  IIMM  states 
equals  the  number  of  phonemes  and  the  neural  network  estimates  posterior  phoneme 
probabilities.  The  extension  of  this  technique  to  context-dependent  modeling  is  possible 
by  factoring  context-dependent  posteriors  and  using  multiple  neural  networks  to  estimate 
context-dependent  observation  probabilities.  This  will  be  described  in  detail  in  a  separate 
chapter  (6). 

3.3.2  Training  Issues 

In  order  to  train  a  neural  network  such  that  the  resulting  outputs  estimate  posterior  class 
probabilities,  we  need  to  generate  target  vectors  for  each  frame.  When  training  the  net¬ 
work  on  1-out-of-iV  targets,  an  explicit  segmentation  in  form  of  class-labels  for  each  frame 
is  necessary.  Usually  these  labels  are  generated  by  an  existing  IIMM  speech  recognizer 
for  the  given  task.  For  any  given  training  utterance,  there  is  a  sentence  transcription 
available.  This  transcription  is  used  to  build  a  sentence  IIMM  model  by  concatenating 
the  IIMM’s  of  the  corresponding  word  models,  which  in  turn,  arc  build  by  concatenating 
subword-unit  IIMM  models  with  respect  to  the  word  pronunciation  dictionary.  Once  a 
IIMM  model  for  the  complete  utterance  is  built,  we  can  do  a  forced  Viterbi  alignment 


P(g.|x) 


3.4.  EXAMPLES  OF  HYBRID  SYSTEMS 


21 


using  the  existing  recognizer,  which  gives  us  the  most  probable  sequence  of  states  through 
the  IIMM,  given  the  sequence  of  acoustic  observations.  Thus,  we  have  generated  state 
labels  for  each  frame  of  the  utterance.  Once  a  neural  network  is  sufficiently  trained  on 
these  targets,  using  the  performance  on  an  independent  cross  validation  set  as  a  mea¬ 
sure  of  generalization,  now  targets  can  be  computed  by  recomputing  the  forced  Viterbi 
alignment  using  the  neural  network  to  compute  emission  probabilities.  This  procedure 
may  continue  in  an  iterative  manner.  Alternatively,  the  Forward-Backward  instead  of  the 
Viterbi  alignment  algorithm  may  be  used  which  will  result  in  soft  targets. 

3.4  Examples  of  Hybrid  Systems 

This  section  will  briefly  describe  two  current  hybrid  systems  that  have  been  succesfully 
used  for  continuous  speech  recognition.  One  is  based  on  large  multi  layer  porceptrons 
(MLP),  the  other  uses  recurrent  neural  networks  (RNN). 

3.4.1  A  MLP  based  Hybrid 

Researchers  at  the  International  Computer  Science  Institute  (ICSI)  in  Berkeley  have  de¬ 
veloped  a  hybrid  speech  recognition  system  that  uses  large  multi  layer  perceptrons  (MLP) 
to  estimate  posterior  class  probabilities.  Fig.  3.5  shows  an  example  of  such  a  network. 


I  I  I  1 


Figure  3.5:  ICSI’s  multi  layer  perceptron  topology 

The  network  is  trained  by  stochastic  gradient  error  backpropagation  using  the  Ring 
Array  Processor  (RAP),  a  parallel  computer  needed  to  keep  training  times  in  a  reasonable 
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range  (days  and  not  weeks).  To  reduce  training  times  even  further,  the  network  was 
initialized  by  training  on  a  hand-labeled  phonetic  database  (TIMIT)  before  training  it  on 
the  larger  target  task. 


3.4.2  A  RNN  based  Hybrid 

The  group  at  Cambridge  University  Engineering  Department  (CUED)  has  developed  a 
hybrid  conncctionist/IIMM  speech  recognition  system  called  ABBOT  [21],  which  uses 
recurrent  neural  networks  to  compute  emission  probabilities.  The  network  is  depicted  in 
Fig.  3.6.  It  uses  a  set  of  state  units  that  have  recurrent  connections  from  their  outputs 
back  to  their  inputs  (these  units  also  have  connections  to  the  input  nodes).  State  units 
and  input  nodes  arc  connected  to  the  output  layer. 


output  layer  recurrent  layer 


input  feature  vectors 


Figure  3.6:  Cambridge  recurrent  neural  network 


The  network  is  trained  using  backpropagation  through  time.  This  training  method  is 
computationally  very  expensive,  researchers  in  Cambridge  report  training  times  of  several 
days  on  a  dedicated  parallel  computer.  Also,  due  to  potential  instabilities  inherent  in  a 
recurrent  systems,  training  seems  to  require  careful  adjustment  of  learning  parameters. 
The  system  has  fewer  parameters  than  a  competitive  mixturc-of-Gaussian  system  which 
yields  a  faster  decoding  stage.  Recently,  the  system  was  augmented  to  incorporate  small 
neural  networks  to  model  context  classes.  This  context-dependent  system  achieved  the 
lowest  reported  error  rate  on  the  1995  SQUALE  continuous  speech  recognition  evaluation 
3.6. 
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3.5  Problems 


All  of  today’s  existing  hybrid  speech  recognition  systems  require  special  parallel  hardware 
to  be  able  to  train  the  neural  networks  in  a  reasonable  amount  of  time.  Also,  they 
require  the  choice  of  lots  of  parameters  such  as  the  learning  rate,  momentum  factor  or 
batch  size.  Although  it  was  shown  that  large  monolithic  neural  networks  can  do  an 
excellent  job  in  the  computation  of  emission  probabilities,  they  arc  mostly  considered 
as  ’black  boxes’.  Because  of  the  lack  of  understanding  how  the  networks  perform  the 
classification  task,  network  weights  arc  usually  intialized  with  small  random  numbers 
which  requires  lots  of  iterations  of  backpropagation  for  the  weights  to  converge.  Mixtures 
of  Gaussians  based  recognizers  benefit  from  powerful  initialization  methods  like  k-mcans 
algorithm.  Parameters  for  such  systems  usually  converge  within  only  2-5  iterations  of 
Forward-Backward  training. 

The  major  drawback  of  hybrid  systems,  however,  is  the  inefficiency  of  gradient  based 
training  algorithms.  Sizes  of  speech  databases  and  neural  networks  in  hybrid  recognizers 
have  gradually  increased  and  will  increase  even  further  over  the  next  years.  Training 
times  for  such  networks  could  become  prohibitive,  even  with  fast  hardware. 
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Chapter  4 

Hierarchical  Mixtures  of  Experts 


This  chapter  introduces  Hierarchical  Mixtures  of  Experts  as  a  modular  and  hierarchical 
neural  network  for  supervised  learning.  It  closely  follows  the  presentation  by  Jordan 
and  Jacobs  [27],  yet  focussing  on  classification  instead  of  regression.  The  underlying 
statistical  model  will  be  discussed  in  detail,  in  order  to  motivate  the  presentation  of  an 
effective  learning  method  for  the  architecture  the  EM  algorithm. 


4.1  Introduction 

The  Hierarchical  Mixture  of  Experts  for  the  purpose  of  classification  is  a  direct  competitor 
to  other,  non-modular  and  non-hierarchical  neural  network  classifiers  such  as  the  Multi 
Layer  Pcrccptrons  or  the  Radial  Basis  Function  Networks,  which  have  proven  to  be  very 
powerful  and  general  classifiers  and  function  approximators.  Therefore,  the  reader  may 
ask  questions  like;  Why  do  we  need  a  modular,  hierarchical  network  if  we  already  have 
powerful  methods  for  classification  and  regression?  What  arc  the  drawbacks  of  traditional 
neural  networks  and  other  monolithic  classifiers  that  lead  to  the  development  of  modular 
and  hierarchical  architectures? 

Fig.  4.1  shows  a  particular  situation,  where  a  modular  approach  to,  in  this  ease, 
function  approximation  yields  significantly  better  results  than  traditional  methods.  The 
function  to  be  approximated  is  piecewise  linear  with  a  discontinuity  at  a;  =  0.  Clearly, 
the  best  way  to  approximate  this  kind  of  function  is  to  split  the  task  into  two  subregions, 
and  apply  standard  linear  regression  to  the  data  in  each  of  the  regions.  This  loads  to 
the  least  possible  number  of  parameters  and  the  best  approximation  possible.  The  figure 
also  shows  a  typical  approximation  obtained  by  an  MLP  or  a  higher  order  polynomial 
interpolation  scheme.  These  methods  usually  produce  smooth  approximation  surfaces  not 
able  to  capture  discontinuities  like  the  one  in  our  example.  Even  worse,  the  discontinuity 
leads  to  oscillations  in  the  overall  approximation  surface  that  can  only  be  reduced  by 
using  a  larger  number  of  parameters  which  in  turn  leads  to  an  unnecessarily  increased 
model  complexity. 
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1-2-1  MLP  learns  piecewise  linear  function 


Another  major  drawback  of  traditional  neural  networks  is  the  complexity  of  their 
training  algorithms,  mostly  based  on  gradient  descent  methods.  This  kind  of  training 
algorithm  is  slow  and  tedious,  requiring  the  user  to  set  various  algorithmic  parameters  by 
trial  and  error.  Training  of  largo  MLPs  on  very  largo  databases  (which  is  the  case  in  hybrid 
speech  recognition)  requires  such  a  large  amount  of  CPU  cycles,  that  oven  when  using 
parallel  implementations  of  backpropagation  on  dedicated  hardware,  researchers  report 
training  times  of  several  days.  This  renders  the  analysis  and  optimization  of  learning 
parameters  very  time  consuming,  if  at  all  possible. 

It  should  bo  noted,  that  recent  work  in  statistics  has  shown  similarities  between  neural 
networks  and  statistical  models  such  as  generalized  linear  models,  maximum  redundancy 
analysis,  projection  pursuit  and  cluster  analysis,  that  allow  the  application  of  much  more 
efficient  statistical  Icarning/estimation  techniques  to  the  training  of  MLPs.  In  fact,  it 
was  shown,  that  an  MLP  with  one  hidden  layer  is  essentially  the  same  as  the  projection 
pursuit  model,  except  that  a  MLP  uses  a  predetermined  functional  form  for  the  activation 
function  in  the  hidden  layer.  Parameters  of  such  a  model  can  be  estimated  more  efficiently 
by  general  purpose  nonlinear  modeling  or  optimization  programs. 

The  remainder  of  this  chapter  will  introduce  a  modular,  hierarchical  architecture  for 
supervised  learning  that  tackles  all  the  discussed  problems  of  standard  neural  networks. 
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4.1.1  Architecture 

The  Hierarchical  Mixture  of  Experts  architecture  consists  of  relatively  simple  (i.c.  one 
layer)  gating  and  expert  networks,  organized  in  a  tree  structure  as  shown  in  Fig.  4.2. 
The  basic  principle  behind  this  structure  is  the  well  known  technique  called  divido-and- 
conquer.  Algorithms  of  this  kind  solve  complex  problems  by  dividing  it  into  simpler 
problems  for  which  solutions  can  be  obtained  very  easily.  These  partial  solutions  arc  then 
integrated  to  yield  an  overall  solution  to  the  whole  problem.  In  the  Hierarchical  Mixtures 
of  Experts  architecture,  the  leaves  of  the  tree  represent  expert  networks,  which  act  as 
simple  local  problem  solvers.  Their  output  is  hierarchically  combined  by  so  called  gating 
networks  at  the  internal  nodes  of  the  tree  to  form  the  overall  solution.  To  be  more  specific, 
the  architecture  has  to  learn  an  input-output  mapping  y  =  /(x)  based  on  a  set  of  training 
samples  T  =  {(xpy;),  z  =  0, . . . ,  A'^}.  The  expert  networks  as  well  as  the  gating  networks 
receive  the  input  vectors  x,  with  the  difference  that  the  gating  networks  use  the  input  to 
compute  confidence  values  for  the  outputs  of  their  children,  whereas  the  export  networks 
use  the  input  to  generate  an  estimate  of  the  desired  output  value. 


There  arc  existing  similar  tree-structured  dividc-and-conquer  models  in  statistics. 
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namely  CART  by  Breiman  ct.  al.  [6],  MARS  of  Friedman  [15]  and  ID3  by  Quinlan 
[44].  However,  these  algorithms  solve  function  approximation  or  classification  problems 
by  explicitly  dividing  the  input  space  into  subregions,  such  that  only  one  single  ’expert’ 
is  contributing  to  the  overall  output  of  the  model.  Caused  by  these  ’hard-splits’  of  the 
input  space,  CART,  MARS  and  ID3  tend  to  be  variance-increasing,  especially  in  the  ease 
of  high-dimensional  input  spaces,  where  data  is  very  sparsely  distributed.  In  contrast, 
the  gating  networks  in  an  IIME  arc  capable  of  computing  soft  splits  of  the  input  space, 
allowing  input  data  to  lie  simultaneously  in  multiple  regions.  In  this  case,  many  exports 
contribute  to  the  overall  output  which  has  a  variance-decreasing  effect. 

All  the  expert  networks  in  the  IIME  tree  realize  linear  mappings  between  the  input 
and  the  output  space  with  an  additional  output  non-linearity.  One  can  also  interpret 
the  experts  as  single  layer  percoptrons.  In  the  ease  of  multiway  classification,  the  non¬ 
linearity  is  generally  chosen  to  be  the  softmax  function,  whereas  in  the  ease  of  regression 
the  non-linearity  is  the  identity  and  the  exports  are  strictly  linear.  The  selection  of 
the  non-linearity  depends  on  the  probabilistic  interpretation  of  the  model  and  will  be 
explained  in  the  following  section. 

Consider  the  two-layer,  binary  branching  IIME  in  Fig.  4.2.  Each  of  the  expert  networks 
{i,j)  produces  its  output  from  the  input  x  according  to: 


Pij  — 


where  Uij  is  a  weight  matrix  and  /  is  the  output  non-linearity.  The  input  vector  x 
is  considered  to  have  an  additional  constant  coordinate  value  of  1.0  to  allow  for  network 
biases. 

The  gating  networks  arc  also  generalized  linear.  Since  they  perform  multiway  classi¬ 
fication  among  the  experts,  the  non-linearity  is  chosen  to  be  the  softmax  non-linearity. 
The  output  values  gi  of  the  top-level  gating  network  arc  computed  according  to: 


cxp(^.) 
E/c  cxp(^fc) 


with  =  yj  X 


Due  to  the  special  form  of  the  softmax  non-linearity,  the  g,-  arc  positive  and  sum  up  to 
one  for  each  input  vector  x.  They  can  be  interpreted  as  the  local  conditional  probability, 
that  an  input  vector  x  lies  in  the  region  of  the  associated  children  node.  The  lower  level 
gating  networks  compute  their  output  activations  similar  to  the  top-level  gating  network: 


with  =  vJjX 


The  output  activations  of  the  expert  networks  are  weighted  by  the  gating  networks 
output  activations  as  they  proceed  up  the  tree  to  form  the  overall  output  vector.  Specif¬ 
ically,  the  output  at  the  z-th  internal  node  in  the  second  layer  of  the  tree  is 


4.1.  INTRODUCTION 


29 


Mi  ~  y~',SiiiMi.7 
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and  the  output  at  the  top  level  (root  node)  is 

M  =  Y^9iM7 

i 

Since  both  the  expert  and  the  gating  networks  compute  their  activations  as  a  function 
of  the  input  x,  the  overall  output  of  the  architecture  is  a  nonlinear  function  of  the  input 
(even  in  the  case  of  linear  experts).  Furthermore,  different  input  spaces  may  be  used 
for  gating  and  expert  networks.  In  the  case  of  speech  recognition,  the  gating  networks 
might  be  supplied  with  additional  input  features,  e.g.  speaking  rate,  in  order  to  facilitate 
discrimination  between  different  sounds. 


4.1.2  Probabilistic  Interpretation 

The  architecture  is  best  understood  as  a  generative  probabilistic  decision  tree.  Observable 
data  is  assumed  to  bo  generated  by  the  model  in  the  following  way:  For  each  input  vector 
X,  the  output  values  computed  by  the  gating  networks  are  interpreted  as  the  multinomial 
probabilities  of  selecting  one  of  the  children  nodes.  Starting  at  the  root  node,  a  particular 
sequence  of  decisions  is  made  based  on  the  probability  distributions  imposed  by  the  gating 
networks.  This  process  eventually  ends  in  a  terminal  node  of  the  tree  containing  a  specific 
expert  network.  This  expert  network  computes  a  linear  activation  mu^  using  its  weight 
matrix.  The  vector  mu^  is  considered  to  be  the  mean  of  a  probability  density  that  models 
the  generation  of  output  vectors. 

The  gating  networks  parameterization  corresponds  to  a  multinomial  logit  probability 
model,  which  is  a  special  case  of  a  Generalized  Linear  Model  (GLIM)  [34].  That  is,  gating 
network  outputs  are  assumed  to  follow  a  multinomial  density 
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where  the  p,  are  the  multinomial  probabilities  associated  with  the  different  classes 
(in  this  case  the  children  nodes)  and  m  =  j/,-  is  generally  taken  to  equal  one  for 

classification  problems. 

The  probability  density  for  the  expert  networks  is  assumed  to  be  a  member  of  the 
exponential  family  of  densities.  In  the  case  of  regression,  the  probabilistic  component  is 
generally  chosen  to  bo  the  Gaussian  density 
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whereas  in  the  case  of  multiway  classification,  the  expert’s  probability  density  function 
is  the  same  as  for  the  gating  networks,  with  the  difference,  that  the  gating  networks 
discriminate  between  children  nodes  and  the  expert  networks  discriminate  between  output 
classes. 

Given  these  assumptions,  the  total  probability  of  generating  the  output  y  from  the 
input  X  can  be  given  in  form  of  a  hierarchical  mixture  model: 

*  i 

In  this  notation,  0  contains  both  the  gating  network’s  parameters  Vi,v.ij  and  the  ex¬ 
pert’s  parameters  dy. 


4.1.3  Posterior  Probabilities 


In  order  to  develop  learning  algorithms  for  the  hierarchy,  we  need  to  introduce  posterior 
node  and  posterior  branch  probabilities.  Consider  the  training  of  a  given  IIME  architec¬ 
ture,  where  we  explicitely  know  the  desired  output  vector  y  for  each  input  vector  x.  In 
this  context,  we  consider  the  gating  probabilities  gi  and  gj\i  to  bo  prior  branch  probabili¬ 
ties,  since  they  are  computed  based  on  the  input  vector  x  alone,  without  any  knowledge 
about  the  target  output  vector  y.  Using  both  the  input  and  output  vectors,  posterior 
branch  probabilities  can  be  defined  for  the  gating  networks: 


I  _  9i  9j\iPijiy )  I  _  ) 

‘  “  Ei  9i  Zi  9i\iP.jiy)  “  Ei  5iii^u(y) 


Based  on  these  conditional  posterior  probabilities,  wo  can  compute  unconditional  node 
probabilities  for  each  node  in  the  tree  by  multiplying  all  the  conditional  posterior  branch 
probabilities  along  the  path  from  the  root  node  to  the  node  in  question.  This  way,  wo 
can  assign  a  posterior  probability  to  each  of  the  export  networks  too: 


h-  —  hihj^i  — 
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h-ij  is  interpreted  as  the  probability  that  export  network  (i,  j)  has  generated  the  ob¬ 
served  data  pair  (x,  y).  Note,  that  posterior  probabilities  are  not  available  during  testing, 
whore  wo  do  not  have  any  knowledge  about  the  target  output  vector  y.  They  arc  exclu¬ 
sively  needed  for  the  derivation  of  learning  algorithms. 


4.2  Gradient  Ascent  Learning 

Since  we  assume  that  the  IIME  realizes  a  probabilistic  generative  model  of  our  data,  we 
can  define  the  likelihood  of  our  model  given  a  training  set  T  =  {(x;,y;),f  =  0, . . .  ,N] 
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and  treat  the  learning  problem  as  a  maximum  likelihood  problem.  This  kind  of  learning 
algorithm  for  IIMEs  was  introduced  by  Jordan  and  Jacobs  [26].  The  derivation  of  this 
learning  algorithm  is  fairly  straight  forward  and  it  can  be  realized  both  as  an  on-line  and 
a  batch  learning  method. 

4.2.1  The  Likelihood 

It  is  common  to  use  the  log  of  the  likelihood  instead  of  the  likelihood  itself,  which  converts 
the  product  of  probabilities  to  a  sum: 


1(0:  X)  =  '£\ogPiy(>^\x^*\0) 

t 

t  i  i 

In  order  to  derive  an  update  algorithm  for  the  gating  network  and  expert  network 
parameters,  we  need  the  derivatives  of  the  log  likelihood  with  respect  to  the  gating  and 
expert  parameters,  respectively.  For  the  top-level  gating  network,  wo  obtain 
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where  we  have  used  the  derivative  of  the  softmax  function 
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Similarly,  it  can  be  shown  that  the  derivative  of  the  likelihood  with  respect  to  the 
second  layer  gating  networks  is 
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Since  wo  are  interested  in  the  set  of  parameters  that  maximise  the  log  likelihood  of 
the  observed  data  given  the  model,  wo  perform  gradient  ascent  in  weight  space  using 
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the  likelihood  gradients  and  a  learning  factor  i]  to  update  the  parameters  of  the  gating 
networks: 


+  T)  Yl(hi  - 

t 
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The  above  learning  rule  suggests  an  update  after  the  presentation  of  the  complete 
training  set.  Instead  of  computing  the  real  gradients  of  the  log  likelihood  over  the  whole 
training  set,  we  could  also  use  a  variant,  called  stochastic  gradient  update,  which  updates 
the  parameters  each  time  a  fixed  number  m  of  training  samples  have  boon  presented  to 
the  architecture.  This  form  of  parameter  update  is  usually  called  on-line  learning  and 
leads  to  faster  convergence. 

It  remains  to  derive  parameter  update  rules  for  the  export  networks.  Depending  on 
the  chosen  probability  density  model  for  the  expert  networks,  we  obtain  different  update 
rules.  Therefore  wo  have  to  distinguish  between  regression  and  classification  tasks  and 
derive  the  different  update  algorithms  in  the  next  two  sections. 


4.2.2  Expert  Parameter  Updates  for  Regression 

When  the  IIME  is  used  for  function  approximation,  the  underlying  probability  density 
is  assumed  to  be  Gaussian.  To  simplify  the  derivation  of  the  update  rule,  wo  assume  a 
unit  variance  Gaussian  density,  although  update  rules  for  Gaussians  with  full  covariance 
matrices  exist  too.  The  gradient  of  the  log  likelihood  with  respect  to  the  {k,  /)-th  expert 
is 

dl{0\X)  _  9k9i\k{dP{y^*^\x^*\  Ok\)IOOhi) 

do, a  ~  Y 
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which  leads  to  the  gradient  update  rule  for  expert  parameters 

t 

Note,  that  the  above  learning  rule  updates  the  whole  weight  matrix  at  once.  If  the 
hierarchy  is  capable  of  learning  a  given  approximation  problem  perfectly,  the  differences 
between  the  target  vectors  and  the  IIME’s  linear  predictions  will  eventually 
converge  to  zero.  The  gradient  of  the  log  likelihood  will  vanish  and  the  updates  will 
become  zero. 
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4.2.3  Expert  Parameter  Updates  for  Classification 

The  objective  of  this  thesis  is  to  apply  modular  neural  networks  in  a  hybrid  speech  recog¬ 
nition  environment.  Therefore,  we  arc  mainly  interested  in  the  use  of  IIMEs  as  classifiers 
and  posterior  class  probability  estimators.  In  the  case  of  classification,  the  same  kind  of 
probability  density  applies  to  the  expert  and  the  gating  networks,  since  they  both  per¬ 
form  multiway  classification.  However,  for  training  a  classifier,  we  usually  have  a  data 
set  with  ’hard’  targets.  That  means,  there  is  a  class  label  associated  with  each  input 
vector  X.  Using  a  1-out-of-jV  encoding  of  class  labels,  the  multinomial  probability  density 
degenerates  as  follows 

■  ■  ■  ’  ^  -  fU 

Here,  the  p;  arc  the  output  values  of  the  classifier  and  the  f;  arc  the  target  values  for 
each  class  (which  arc  zero  for  all  but  one  class).  Hc  stands  for  the  output  value  associated 
with  the  correct  target  class.  Using  this  simplified  probability  model,  wc  obtain  the 
derivative  of  the  log  likelihood  with  respect  to  the  weight  vector  of  node  m  in  expert 
network  {k,  1) 
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which  leads  to  the  following  expert  network  parameter  update  rule 
=  (’iil  +  '?(  hij{l  -  Piy„0x<*^  - 

t,c=m  i.ci^m 

Again,  the  update  formulas  can  either  be  used  in  on-line  or  in  batch  mode.  Wc 
will  postpone  the  evaluation  of  the  gradient  ascent  learning  rule  until  after  the  next  two 
sections,  where  we  will  derive  a  more  efficient  learning  algorithm  for  the  IIME  architecture. 


4.3  EM  Learning 

The  Expectation  Maximization  (EM)  algorithm  of  Dempster  et.  al.  [10]  is  a  general  tech¬ 
nique  for  maximum  likelihood  estimation.  It  is  mainly  applied  to  unsupervised  learning. 
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i.c.  clustering  and  mixture  density  estimation.  The  most  popular  application  of  EM  to 
unsupervised  learning  in  the  context  of  speech  recognition  is  the  Baum- Welch  or  Ebrward- 
Backward  algorithm  that  solves  the  learning  problem  for  Hidden  Markov  Models.  The 
EM  algorithm  is  a  very  powerful  iterative  algorithm  for  maximum  likelihod  problems  in¬ 
volving  missing  data.  Fbr  example,  in  speech  recognition,  the  Baum- Welch  Rccstimation 
usually  converges  in  only  2-5  iterations.  There  is  no  reason,  why  the  EM  framework 
should  not  be  applicable  to  supervised  learning  problems  like  the  IIME  learning  as  well. 

4.3.1  General  EM  Algorithm 

The  iterative  EM  algorithm  is  composed  of  two  stops.  The  E-stop  (Expectation)  defines 
a  new  likelihood  function  in  each  iteration,  that  is  maximised  during  the  M-stop  (Max¬ 
imization).  Often,  E-  and  M-step  arc  combined  in  a  single  undivisiblc  algorithm,  but 
for  theoretical  purposes  we  will  distinguish  between  the  two  steps.  If  the  M-step  only 
increases  the  likelihood  instead  of  maximizing  it  in  each  step,  the  algorithm  is  called  Gen¬ 
eralized  Expectation  Maximization  (GEM).  The  learning  algorithm  for  the  Boltzmann 
machine,  for  example,  is  essentially  a  GEM  algorithm. 

In  order  to  apply  the  EM  algorithm  to  a  new  domain,  a  set  of  ’missing’  or  ’unknown’ 
variables  have  to  be  defined,  that  would  simplify  the  optimization  of  the  log  likelihood,  if 
they  were  known.  We  then  distinguish  between  the  incomplctc-data  log  likelihood  l{0;  X) 
over  the  observable  data  X  and  the  complctc-data  log  likelihood  U{0-,  Y)  over  the  extended 
data  Y  =  XUZ  which  includes  the  set  of  missing  variables  Z.  It  is  important  to  note,  that 
the  complete-data  log  likelihood  is  a  random  variable  because  the  set  of  missing  variables 
arc  unknown. 

The  EM  algorithm  aims  at  increasing  an  estimation  of  the  complctc-data  log  likelihood 
as  follows.  Using  the  observed  data  and  the  current  model,  the  E-step  first  computes  the 
expected  value  of  the  complctc-data  log  likelihood: 

g(d,dW)  =  £■[/,(<);  3^)|Af] 

The  superscript  k  refers  to  the  parameters  at  the  k-kh  iteration  of  the  algorithm.  The 
E-step  yields  a  deterministic  function  Q  of  the  parameters  of  the  model.  The  M-step 
maximizes  the  Q-function  with  respect  to  the  model’s  parameters: 

^(*+1)  _  iirgmaxQ(0,0^*‘^ 

The  process  iterates  by  looping  over  E-  and  M-step  until  the  maximization  yields  no 
further  improvement.  The  EM  algorithm  guarantees  to  compute  parameter  estimates  that 
increase  the  Q-function  in  each  iteration.  The  Q-function,  however,  is  just  the  expected 
value  of  the  complctc-data  log  likelihood.  Our  goal  is  to  maximize  the  incomplete-data 
log  likelihood.  Dempster  ct.  al.  addressed  this  issue  and  proved  that  an  increase  in  the 
Q-function  always  implies  an  increase  in  the  incomplctc-data  log  likelihood: 
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Q{0, >  Q{0, ^  X)  >  X) 

That  means,  the  original  likelihood  1(0]  X)  increases  monotonically  with  every  itera¬ 
tion,  converging  to  a  local  minimum. 

4.3.2  Applying  EM  to  the  HME 

Application  of  EM  to  the  IIME  architecture  involves  the  definition  of  ’missing’  variables 
that  facilitate  the  optimization  of  the  log  likelihood.  Lot  0,-,  i  =  1, . . . ,  n  be  a  set  of  binary 
indicator  variables  for  the  top-level  gating  network,  and  let  Zj\i,  i,j  =  1, . . . ,  n  be  a  set  of 
binary  indicator  variables  for  the  second  layer  gating  networks.  Fbr  any  given  input  vector 
X  exactly  one  of  the  ^iS  is  one,  all  the  others  arc  zero.  Similarily,  given  the  z.i,  exactly 
one  of  the  Zj\i  is  one,  all  the  others  arc  zero.  The  ZiS  and  0j|,s  have  an  interpretation  as 
the  (unknown)  decisions  corresponding  to  the  probability  model.  An  instantiation  of  the 
^,s  and  2j|iS  corresponds  to  a  specific  path  from  the  root  node  of  the  tree  to  one  of  the 
leaves,  determining  the  expert  responsible  for  data  generation.  Note,  however,  that  the 
ZiS  and  z^iis  arc  not  known  and  must  be  treated  as  random  variables.  If  they  were  known, 
the  maximum  likelihood  problem  for  the  IIME  would  decouple  into  a  set  of  independent 
maximum  likelihood  problems  for  each  of  the  gating  and  expert  networks.  .Although  the 
ZjS  and  Zj\iS  arc  unknown,  we  can  specify  a  complete-data  log  likelihood  probability  model 
that  links  them  to  the  observable  data  and  allows  for  the  application  of  the  EM  algorithm: 

(t) 

ic(0]y)  =  iognnn#4f^->(y^‘^r' 

t  i  i 
t  i  j 

=  +  logfifjii  +  logFii(y<*^)) 

t  i  j 

The  above  complete-data  log  likelihood  is  much  easier  to  maximize  than  the  cor¬ 
responding  incomplete-data  log  likelihood,  because  we  managed  to  bring  the  logarithm 
inside  the  summation. 

One  can  prove  easily  that  the  posterior  probabilities  h,,  hj]i  and  hij  can  be  used  as 
the  expected  values  for  the  unknown  indicator  variables  z,-,  Zj]i  and  zij,  respectively  (see 
[26]  for  a  proof).  Using  this  fact,  we  can  define  the  Q-function  for  the  E-step  of  the  EM 
algorithm: 


Q(0,0^^^)  =  A  logjfjf?  -h  logPij(yW)) 

t  %  j 
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The  M-step  requires  the  maximization  of  the  Q-function  with  respect  to  the  model’s 
parameters.  We  now  see  the  benefits  of  applying  EM,  since  the  maximization  decouples 
into  a  set  of  separate  maximum  likelihood  problems  that  may  be  solved  independently 
during  the  M-step: 


arg  naax  ^  ^ 
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Since  we  arc  mainly  interested  in  the  IIME  as  a  classifier,  we  will  restrict  the  derivation 
of  solutions  for  the  above  maximum  likelihood  problems  to  this  ease,  assuming  a  multi¬ 
nomial  (Poisson)  density  as  the  probabilitiy  model  for  the  expert  as  well  as  the  gating 
networks.  Under  these  assumptions,  the  log  likelihood  equation  for  the  expert  and  gating 
network’s  parameters  arc  weighted  log  likelihoods  for  a  special  ease  of  a  Generalized  Lin¬ 
ear  Model  (GLIM),  namely  a  multinomial  logit  model.  Fbr  the  top-level  gating  networks, 
we  have  to  maximize  the  cross-entropy  between  the  posterior  branching  probabilities  hi 
and  the  branching  (prior)  probabilities  gi.  For  the  second  level  gating  networks,  we  have 
to  maximize  the  cross-entropy  between  the  posterior  branching  probabilities  and  the 
branching  (prior)  probabilities  g,n\i,  weighted  by  the  posterior  probability  h-i  of  the  gating 
node  itself.  In  deeper  trees,  the  weight  for  the  cross-entropy  is  simply  the  product  of 
posterior  branching  probabilities  along  the  path  from  the  root  node  to  the  gating  node  in 
question.  Finally,  the  maximization  problem  for  the  expert  networks  involves  maximizing 
the  cross-entropy  between  the  expert’s  posterior  probability  and  the  output  at  the  node 
of  the  actual  correct  class.  Since  all  of  the  above  maximization  problems  arc  based  on 
likelihoods  for  generalized  linear  models,  we  can  apply  an  algorithm  called  Iteratively 
Reweighted  Least  Squares  (IRLS)  [34]  that  solves  such  likelihood  problems. 


*  i 


4.3.3  Iteratively  Reweighted  Least  Squares  (IRLS) 

Applying  the  EM  algorithm  to  the  IIME  architecture  requires  the  computation  of  posterior 
probabilities  hi,  hj\i  and  hij  for  each  input  vector  x  in  the  E-step,  and  the  maximization 
of  independent  maximum  likelihood  problems  for  GLIMs  in  the  M-step.  This  process  is 
iteratively  repeated  until  no  further  improvement  can  be  obtained.  This  section  describes 
the  IRLS  algorithm  that  can  be  used  to  solve  the  maximization  problems  within  the  M- 
step.  The  IRLS  algorithm  is  a  special  ease  of  the  Fisher  scoring  method  [12].  In  order 
to  maximize  the  log  likelihood  I(/l;  X)  with  respect  to  the  parameter  vector  S,  the  Fisher 
scoring  method  updates  3  according  to 
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.(^■+1)  =  .w  fp  dimx) 
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This  equation  strongly  resembles  the  Newton- Raphson  equation  with  the  notable  dif¬ 
ference  that  in  the  Fisher  scoring  method,  the  Hessian  is  replaced  by  the  expected  value 
of  the  Hessian.  Besides  the  fact,  that  the  expected  value  of  the  Hessian  is  often  easier  to 
compute,  there  are  statistical  reasons  for  prefering  it  over  the  actual  Hessian. 

Wo  will  now  derive  the  IRLS  algorithm  for  the  special  ease  of  a  multinomial  GLIM. 
The  multinomial  density  is  a  member  of  the  exponential  families  of  distributions  which  is 
an  important  class  of  distributions  in  statistics.  It  can  be  rewritten  in  the  following  form: 


P{yx,---.yn)  - 

f  m'  ”  I 

where  we  have  used  the  constraint  that  the  p,-  sum  up  to  one  to  express  p„  as  p„  = 
1  —  Pi-  Comparing  this  form  of  the  multinomial  density  with  the  general  form  of  a 
density  of  the  exponential  family 

r(y.,.^)  =  oxp{hl^  +  c(y.t)} 

with  the  natural  parameter  rj  and  the  dispersion  parameter  we  can  define  the  natural 
parameter  t]  to  bo  the  vector  of  ijis: 


1  Pi 
log  — 

Pn 


=  log 


=  iog|p,  1^1  +  X)c5cp(%; 


This  equation  can  be  inverted  to  yield 


1  +  YTjZl  cxp(r?,) 
cxp(??.) 

ELi  cxp(r?j) 
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which  is  the  ’softmax’  function  that  we  have  assumed  as  the  non-linearity  for  the  gating 
and  export  networks.  By  parameterizing  the  multinomial  probability  density  in  terms  of 
the  natural  parameter  r],  we  have  forced  the  choice  of  the  network’s  output  non-linearity 
to  bo  the  softmax  function.  The  softmax  function  is  refered  to  as  the  canonical  link  to 
the  multinomial  distribution.  Other  choices  of  the  output  probability  density  result  in 
different  canonical  links,  for  example,  assuming  a  Bernoulli  density  yields  the  standard 
sigmoid  function  as  the  canonical  link  function. 

Having  justified  the  choice  of  the  output  non-linearity,  we  now  proceed  in  the  derivation 
of  the  IRLS  update  equations.  First  we  define  the  function  b  implicit  as  the  integral  of 
the  softmax  function; 


Ej  cxp(?yj‘’) 


with  ??f '  =  3{^ 


We  can  now  compute  the  terms  necessary  for  the  Fisher  scoring  equation,  that  is,  we 
need  the  likelihood  and  the  first  and  second  derivatives  of  the  likelihood  of  a  multinomial 
GLIM: 


=  EE  (.^k' xWyl'’  -  b{3lx^*^))  +  log 


ml 


t  k 


(yiO  •••(!/..!) 


dli3-,X) 

OlSi 


t  k 


=  EE(rf- 


03i 


r(‘) 


d3id3]  t  'E  93id3j 

Finally,  by  assembling  all  these  equations  into  the  Fisher  scoring  update  function,  we 
obtain  the  following  IRLS  algorithm  for  multinomial  GLIMs: 

3{k+i)  ^  g^k)  ^  x'^  W,ei 

where  IF,  is  a  diagonal  matrix  with  diagonal  elements 

=  E  KV*:.'  -  pf')] 


and  e;  is  the  vector  of  scalars  cf': 


The  weight  matrices  IF,  and  the  vectors  e;  change  from  iteration  to  iteration  because 
they  depend  on  the  weight  vectors  3i-  The  above  update  equation  is  essentially  a  solution 
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to  a  weighted  least  squares  problem.  In  our  ease,  we  need  to  extend  the  IRLS  algorithm 
because  we  have  additional  fixed  observation  weights  imposed  by  the  gating  networks. 
This  can  easily  be  done  by  multiplying  the  fixed  observation  weights  with  the  iteratively 
varying  weight  matrices  IT,-,  which  leads  to  an  iteratively  reweighted  weighted  least  squares 
algorithm.  Applying  this  algorithm  to  the  IIME  architecture  yields  the  following  training 
method: 

1.  Expectation  Step: 

Compute  posterior  branching/ node  probabilities  h\*\  and  h\f  for  each  data  pair 

(x(fi,y(fi)  of  the  training  sot. 

2.  Maximization  Step: 

(a)  Inner  loop  for  experts: 

For  each  export  network,  solve  an  IRLS  problem  with  observations  (x^*', 
and  observation  weights  k\*j. 

(b)  Inner  loop  for  top-level  gates: 

For  each  top-level  gating  network,  solve  an  IRLS  problem  with  observations 
(x(‘>,/if’). 

(c)  Inner  loop  for  second-level  gates: 

For  each  second-level  gating  network,  solve  a  weighted  IRLS  problem  with 
observations  and  observation  weights 

3.  Iterate  EM  steps  using  the  updated  parameter  values. 

This  EM  algorithm,  though  being  quite  effective,  needs  an  iterative  procedure  in  the 
M-step,  while  posterior  probabilities  need  to  be  stored  temporarily.  This  is  not  feasible 
when  dealing  with  large  data  sets,  as  is  the  ease  in  speech  recognition.  Therefore,  we  arc 
interested  in  a  version  of  the  EM  algorithm,  that  allows  to  solve  the  maximization  steps 
in  one  pass.  There  arc  two  ways  of  achieving  this.  The  first  one  is,  to  relax  the  constraint 
of  maximization  in  the  M-step  and  derive  a  Generalized  EM  algorithm  (GEM)  that  only 
guarantees  to  increase  the  log-likelihoods  during  the  M-step.  The  other  way  is  to  use 
least  squares  fitting  instead  of  likelihood  maximization  together  with  heuristics  to  derive 
a  practically  useful  learning  algorithm,  which  we  will  do  in  the  next  section. 


4.4  Least  Squares  and  Heuristics 

Recall  the  three  maximization  problems  derived  from  the  Q-function  and  which  we  want 
to  solve  in  a  one-pass  algorithm: 
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=  argmax^52/j.f’logflif^ 

=  argmaxX;EM‘’EM«|;logfi'i!|( 

■^1*  t  I  m 

=  argmaxE/^ij’logJ^y(y^‘’) 

“•}  t 

Computing  the  derivatives  of  the  log  likelihoods  with  respect  to  the  parameters  v,, 
Vj|j  and  Ojj/^  respectively,  and  setting  them  to  zero  yields: 


i 


In  the  above  equations,  one  can  think  of  the  posteriors  as  being  targets  for  the  gating 
and  expert  network  outputs.  As  mentioned  before,  the  posteriors  arc  estimates  of  the 
unknown  indicator  variables  which  would  be  the  correct  targets,  if  they  were  known.  By 
inverting  the  softmax  non-linearity  at  the  outputs  of  gating  and  expert  networks,  we  can 
compute  targets  for  the  linear  predictors  which,  in  turn,  can  be  used  for  standard  least 
squares  fitting.  Inverting  the  softmax  function 

Vi  =  — - 7—r  yields  Xi  =  log  y,-  -|-  log  E cxp(a:j)  =  log  yi  +  C 

EjCXpiXj)  ■ 

The  second  term  is  constant  for  all  a;,-  and  constant  terms  common  to  all  a’,s  disappear 
when  the  softmax  function  is  applied.  Therefore,  we  can  use  the  log  yiS  as  targets  for  the 
linear  predictors.  In  the  ease  of  the  gating  networks  we  obtain  the  following  one-pass  least 
squares  solutions  to  the  maximization  problem: 


Vi  = 

v,|.-  =  {x'^wxy'^xwi 


with  e  =  (log  h^\  . . . ,  log  f  =  (log  hjj] , . . . ,  log  W  =  I  (hf’, . . . , 

However,  trying  to  compute  targets  for  the  linear  predictors  of  the  expert  networks, 
we  face  the  problem  of  having  to  compute  the  log  of  zero  since  all  but  one  coefficient  of  the 
target  vectors  arc  zero.  The  heuristic  here  is,  to  use  targets  f,-  out  of  {c,  1}  instead  of  the 
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usual  {0,1}.  In  practice,  the  value  of  c  is  subject  to  optimization,  but  small  values  around 
Ic  —  3  have  proven  to  work  well.  Thus,  the  least  squares  problem  for  expert  networks  is 
solvable  as  before: 

Oijk  =  {x'^wxy^xwe 

with  e  =  (log . . . ,  log and  W  =  I(h\j\ . . .  Using  standard  (weighted) 

least  squares,  we  were  able  to  derive  an  effective  EM  algorithm  with  a  one-pass  M-step, 
suitable  for  largo  hierarchies  and  largo  data  sots.  During  training,  we  have  to  compute 
posterior  probabilities  and  accumulate  the  weighted  input  vectors  into  the  least  squares 
matrices  and  vectors.  After  one  iteration,  a  single  matrix  inversion  for  each  export/gating 
network  and  a  matrix-vector  multiplication  yields  new  parameter  estimates.  In  the  re¬ 
minder  of  this  chapter,  we  will  evaluate  the  EM  algorithm  and  the  gradient  ascent  al¬ 
gorithm  in  terms  of  accuracy,  generalization  and  convergence  speed  on  a  relatively  small 
task.  Wo  will  also  compare  the  IIME  with  a  multi  layer  perceptron  (MLP)  trained  by 
crror-backpropagation.  The  integration  of  IIME’s  into  a  hybrid  speech  recognition  frame¬ 
work  will  be  evaluated  later  in  a  separate  chapter. 


4.5  HME  for  Vowel  Classification 

We  will  demonstrate  the  properties  of  the  IIME  architecture  and  its  learning  algorithms 
on  Peterson  and  Barneys  vowel  classification  data  set  [42],  We  chose  this  dataset  because 
it  is  non-artificial,  speech  recognition  related  and  relatively  small,  allowing  to  explore 
and  analyze  the  space  of  learning  parameters.  Another  advantage  of  this  dataset  is  its 
low  dimensionality.  Wo  can  easily  reduce  the  originally  four-dimensional  feature  vectors 
to  two-dimensional  feature  vectors,  which  allows  us  to  draw  certain  properties  of  the 
classfiers  in  a  two  dimensional  coordinate  system.  We  think  that  this  kind  of  analysis 
provides  deeper  insight  and  better  understanding  of  the  way,  the  IIME  works. 

4.5.1  The  Data  Set 

The  data  set  consists  of  1520  four  dimensional  feature  vectors.  The  feature  coefficients 
are  the  formant  frequencies  F0,F1,F2  and  F3.  The  data  set  contains  an  equal  number 
of  training  vectors  for  each  of  the  following  10  American  English  vowels  (uniform  prior 
distribution). 


lY  IH  EH  AE  AH  AA  AO  UH  UW  ER 

W'c  did  not  preprocess  the  data  in  any  way,  except  that  we  normalized  each  of  the 
four  formant  frequencies  in  the  data  set  independently  to  the  range  [0, 1].  Fig.  4.3  shows 
the  complete  data  set  in  the  normalized  (F1,F2)  feature  space. 
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Figure  4.3:  Peterson  fc  Barneys  vowel  classification  data  set  x  =F1,  y  =F2 


Syrdal  and  Gopal  [50]  performed  classification  on  this  dataset  using  a  quantitative 
perceptual  model  of  human  vowel  recognition.  They  reported  classification  rates  between 
82.3%  and  85.9%  for  their  classifier  based  on  bark  scale  differences  and  linear  discriminant 
analysis  (LDA).  Human  listeners  achieved  an  average  classification  rate  of  94.4%  when 
hearing  the  original  recordings  of  the  vowels. 

4.5.2  Results 

Fig.  4.4  shows  the  evolution  of  the  likelihood  on  the  training  data  and  the  moan  square 
error  and  the  classification  error  on  the  tost  data  for  a  GLIM  a  MLP  and  different  IIME 
architectures  (branching  factor  2,  depth  1,2  and  3).  The  IIME’s  wore  trained  with  a 
combination  of  the  Least  Squares  heuristic  to  EM  and  the  gradient  ascent  algorithm.  Wo 
found,  that  the  Least  Squares  heuristic  converges  very  fast  (faster  than  the  gradient  based 
training)  but  is  not  able  to  achieve  the  same  performance. 
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Figure  4.4:  Typical  training  runs  on  PctcrsonfeBarncys  vowel  data 
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Therefore,  we  used  LS  for  the  first  few  iterations  before  switching  to  GA,  which  gave 
the  best  results.  The  MLP  was  trained  with  on-line  stochastic  gradient  error  backprop- 
agation  with  a  learning  rate  of  0.1  (optimized  by  several  trials).  The  training  runs  were 
performed  on  4-dimonsional  feature  vectors.  Comparing  classifier  performances  with  re¬ 
spect  to  the  classification  error  rate,  one  can  see  that  a  simple  GLIM  is  competitive  with 
both  a  2-laycr  MLP  with  24  hidden  units  and  the  different  IIME  architectures.  However, 
the  evolution  of  the  likelihood  and  mean  square  error  show  that  MLP  and  IIME’s  are  able 
to  learn  the  data  bettor.  Several  things  deserve  to  be  mentioned; 

•  MLP  and  IIME’s  achieve  roughly  the  same  performance 

•  Convergence  is  much  faster  for  the  IIME’s  duo  to  the  EM  algorithm 

•  Different  IIME  architectures  do  not  vary  significantly  in  the  case  of  the  vowel  data. 

Fig.  4.5  shows  the  class  boundaries  imposed  on  a  2-dimonsional  feature  space  (P1,F2) 
by  an  IIME  (depth  3, branching  factor  2)  and  an  MLP  (24  hidden  units),  respectively. 


Figure  4.5:  Class  boundaries  obtained  by  IIME  (loft)  and  MLP  (right) 


IIME  and  MLP  were  trained  until  convergence  on  the  2-dimonsional  feature.  The 
plots  in  Fig.  4.5  were  computed  by  sampling  the  interval  [0, 1]^,  coloring  the  class  with 
highest  output  activation  in  different  shades  of  gray.  The  MLP  seems  to  prefer  non¬ 
linear  curvy  class  boundaries,  whereas  the  IIME  imposes  almost  linear  ones.  It  seems 
that  the  IIME  discovers  that  the  task  docs  not  need  a  soft  collaboration  between  exports, 
therefore  partitioning  the  input  space  into  disjunct  segments,  which  arc  classifier  by  the 
(generalized)  linear  experts. 
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Fig.4.6  shows  the  evolution  of  the  activation  regions  of  the  experts  while  training  the 
architecture.  The  plots  are  sampled  in  the  same  region  [0, 1]^  as  before,  coloring  the  expert 
with  the  highest  cummulative  gating  probability  in  different  shades  of  gray.  Obviously, 
as  the  training  proceeds,  the  IIME  shuts  off  5  of  its  8  experts  completely.  A  combination 
of  3  experts  seems  to  bo  enough  to  solve  the  given  task.  This  again  moans,  that  a  lot  of 
parameters  in  the  IIME  tree  are  rendered  useless  in  this  specific  application. 


Figure  4.6:  Evolution  of  export’s  regions  of  activation  (after  1,2, 3, 4  and  9  iterations, 
respectively) 


Since  we  do  not  know  in  advance,  how  many  experts  arc  sufficient  to  solve  a  given 
problem  adequately,  we  can  only  guess  and  use  an  architecture  that  is  likely  to  contain 
more  experts  than  needed.  This  approach  to  model  selection  is  clearly  a  waste  of  parame¬ 
ters.  The  next  chapter  addresses  this  problem  by  presenting  a  constructive  method  which 
iteratively  grows  an  IIME  architecture  that  uses  its  parameters  more  effectively. 
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Chapter  5 

Constructive  Methods 


5.1  Motivation 

One  of  the  essential  problems  with  the  IIME  approach,  as  with  other  neural  architectures, 
is  model  selection.  Applying  IIME’s  to  a  classification  or  regression  problem  requires  the 
choice  of  structural  parameters  such  as  the  tree  depth  and  the  branching  factor.  As  with 
other  architectures,  the  problem  of  model  selection  is  mostly  solved  in  a  rather  simple 
way.  Architectures  of  different  size  arc  trained  and  their  performances  arc  compared 
on  an  independent  test  set  to  select  the  one,  that  generalizes  best.  This  approach  is 
computationally  very  expensive  especially  when  dealing  with  large  data  sets. 

Better  solutions  to  selecting  model  sizes  arc  constructive  and/or  pruning  methods. 
Constructive  methods  iteratively  generate  larger  models  starting  from  a  very  small  one. 
For  example,  Fahlman’s  cascade  correlation  algorithm  realizes  such  a  constructive  method 
for  a  special  multi-layered  network.  The  basic  idea  in  all  growing  algorithms  is  to  use 
some  criterion  on  the  training  data  to  select  the  locally  best  expansion  out  of  the  set  of  all 
possible  expansions  to  adaptively  generate  an  architecture  that  fits  the  data  better  than 
its  static  counterpart. 

Pruning  methods,  on  the  other  hand,  use  the  opposite  strategy:  A  large  (possibly 
oversized)  architecture  is  evaluated  to  detect  obsolete  or  ineffective  parts  which  then 
arc  removed  before  the  architecture  is  ro-trained.  This  process  can  also  be  repeated 
iteratively  using  the  performance  on  an  independent  test  set  as  the  stopping  criterion. 
Computationally,  pruning  methods  have  the  disadvantage  of  repeatedly  requiring  the 
training  of  unnecessarily  large  architectures. 

Because  of  the  inherent  tree  structure  of  the  IIME,  it  is  very  appealing  to  derive  a 
growing  algorithm  for  this  architecture.  The  machine  learning  literature  offers  a  wide 
variety  of  growing  algorithms  for  classification  and  decision  trees  [44],  [45],  [6].  Unfor¬ 
tunately,  these  algorithms  require  the  evaluation  of  the  gain  of  all  possible  node  splits, 
using  (mostly)  entropy  or  likelihood  based  criterions,  to  eventually  realize  the  best  split 
and  discard  all  the  others.  Waterhouse  and  Robinson  [56]  presented  such  an  algorithm 
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for  the  IIME  architecture.  They  evaluated  their  growing  algorithm  on  a  relatively  small 
data  set.  In  the  case  of  very  large  speech  data  sets,  their  approach  is  no  longer  applicable 
in  a  reasonable  amount  of  time.  VVe  therefore  developed  a  different  growing  algorithm  for 
the  IIME  architecture  which  imposes  very  little  overhead  and  which  is  applicable  in  our 
domain. 


5.2  Algorithms 

VVe  distinguish  between  tree  growing  and  tree  pruning,  although  both  techniques  arc 
usually  applied  simultaneously,  in  order  to  achieve  faster  learning  and  recognition  passes. 

5.2.1  Adaptive  Tree  Growing 

In  order  to  grow  an  IIME,  we  have  to  define  an  evaluation  criterion  to  score  the  experts 
performance  on  the  training  data,  which  in  turn  will  allow  us  to  select  the  worst  expert  to 
be  split  into  a  new  subtree,  providing  additional  parameters  which  can  help  to  overcome 
the  errors  made  by  this  expert. 

Viewing  the  IIME  as  a  probabilistic  model  of  the  observed  data,  we  partition  the  input 
dependent  likelihood  of  data  generation  using  the  expert  selection  probabilities  provided 
by  the  gating  networks 


t  t  k 

k  t  k 

where  the  gf.  arc  the  products  of  the  gating  probabilities  along  the  path  from  the  root 
node  to  the  k-th  expert,  that  is,  gk  is  the  probability  that  expert  k  is  responsible  for 
generating  the  observed  data  (note,  that  the  gk  sum  up  to  one).  The  expert-dependent 
scaled  likelihoods  lk{&',X)  can  be  used  as  a  measure  for  the  performance  of  an  expert 
within  its  region  of  responsibility.  We  use  this  measure  as  the  basis  of  our  tree  growing 
algorithm: 

1.  Initialize  and  train  a  simple  IIME  consisting  of  only  one  gate  and  several  experts. 

2.  Compute  the  expert-dependent  scaled  likelihoods  4(0;  Af)  for  each  expert  in  one 
additional  pass  through  the  training  data. 

3.  Find  the  expert  k  with  minimum  and  expand  the  tree,  replacing  the  expert  by  a 
new  gate  with  random  weights  and  new  experts  that  copy  the  weights  from  the  old 
expert  with  additional  small  random  perturberations. 


5.3.  EXPERIMENTS 


49 


4.  Train  the  architecture  to  a  local  minimum  of  the  classification  error  using  a  cross- 
validation  set. 

5.  Continue  with  step  (2)  until  desired  tree  size  is  reached. 

The  number  of  tree  growing  phases  may  either  be  pre-determined,  or  based  on  dif¬ 
ference  in  the  likelihoods  before  and  after  splitting  a  node.  In  contrast  to  the  growing 
algorithm  in  [56],  our  algorithm  docs  not  hypothesize  all  possible  node  splits,  but  deter¬ 
mines  the  expansion  nodc(s)  directly,  which  is  much  faster,  especially  when  dealing  with 
large  hierarchies. 

5.2.2  Pruning 

Furthermore,  we  implemented  a  path  pruning  technique  similar  to  the  one  proposed  in 
[56],  which  speeds  up  training  and  testing  times  significantly.  During  the  recursive  depth- 
first  traversal  of  the  tree  (needed  for  forward  evaluation,  posterior  probability  computation 
and  accumulation  of  node  statistics)  a  path  is  pruned  temporarily  if  the  current  node’s 
probability  of  activation  falls  below  a  certain  threshold.  Additionally,  we  also  prune  sub¬ 
trees  permanently,  if  the  sum  of  a  node’s  activation  probabilities  over  the  whole  training 
set  falls  below  a  certain  threshold.  This  technique  is  consistent  with  the  growing  algo¬ 
rithm  and  helps  prevent  instabilities  and  singularities  in  the  parameter  updates,  since 
nodes  that  accumulate  too  little  training  information  will  be  pruned  away,  without  being 
considered  for  a  parameter  update. 

Temporarily  pruning  branches  of  the  IIME  tree  can  speed  up  training  and  testing 
times  considerably,  although  this  will  most  likely  lead  to  an  increase  in  error  rate.  We 
will  present  results  of  experiments  with  different  pruning  thresholds  and  their  impact 
on  the  performance  of  an  IIME  system.  For  speech  recognition  applications,  a  means 
for  trading  off  accuracy  against  speed  is  very  appealing,  especially  for  demo  systems, 
where  the  system’s  reaction  time  is  more  important  than  its  performance  (although  an 
improvement  in  both  directions  is  desirable,  of  course).  We  will  therefore  also  examine 
the  effect  of  IIME  pruning  on  speech  recognition  performance. 


5.3  Experiments 

We  evaluate  the  tree  growing  and  pruning  algorithms  on  the  Peterson  &:  Barney  vowel 
classification  task,  comparing  the  resulting  IIME’s  with  standard  pre-determined  IIME 
architectures. 

5.3.1  Tree  Growing 

We  compare  a  standard  binary  tree  IIME  (depth  3)  containing  8  experts  with  an  adap¬ 
tively  grown  binary  IIME  with  the  same  number  of  experts.  Fig.  5.1  and  Fig.  5.2  show 


50 


CHAPTER  5.  CONSTRUCTIVE  METHODS 


the  evolution  of  the  classification  rate  and  log-likelihood  during  training.  The  standard 
IIME  achieves  it’s  final  performance  after  9  iterations,  the  growing  IIME  is  able  to  achieve 
the  same  performance  after  8  iterations,  at  this  time  consisting  of  only  3  experts.  This  is 
consistent  with  our  earlier  observations. 
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Figure  5.1:  Classification  rate  for  standard  and  growing  IIME 

The  bumpiness  of  the  curves  for  the  growing  IIME  are  due  to  the  node  splitting,  that 
was  done  after  every  4  iterations.  Each  time  a  node  is  being  split,  two  now  exports  are 
introduced  and  initalizod  by  the  splitting  candidate’s  parameters  with  small  additional 
random  perturborations.  This  causes  an  initial  decrease  in  both  classification  rate  and 
log-likelihood  which  is  soon  redeemed  by  the  power  of  additional  parameters. 

One  of  the  motivations  for  the  growing  algorithm  was  the  desire  to  use  the  available 
parameters  effectively.  Fig.  5.3  and  Fig.  5.4  compare  the  two  architectures  in  this  respect. 
They  show  the  final  topologies  together  with  histograms  at  each  internal  node,  approx¬ 
imating  the  distributions  of  gating  probabilities  over  the  test  set.  The  histogram  trees 
should  bo  interpreted  as  follows: 

•  A  sharp  peak  at  the  left  or  right  side  of  a  histogram  indicates  that  one  of  the  two 
children  nodes  is  shut  off  by  the  corresponding  gate. 

•  Peaks  both  at  the  left  and  the  right  side  of  a  histogram  indicate  a  more  or  loss  hard 
split  of  the  input  space  by  the  corresponding  gate. 

•  A  peak  in  the  middle  of  the  histogram  indicates  that  the  corresponding  gate  makes 
use  of  soft  splits  of  the  input  space. 


Average  Classification  Rate  for  different  test  sets 
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Figure  5.2:  Log-likelihood  for  standard  and  growing  IIME 

As  one  can  see  in  Fig.  5.3,  only  4  of  the  8  experts  can  contribute  to  the  overall  output 
of  the  hierarchy,  the  remaining  4  experts  are  ’pinched-off’  almost  completely. 

Fig.  5.4  shows  the  same  histogram  tree  for  the  grown  architecture.  Here,  almost  all 
experts  contribute  to  the  overall  output.  The  criterion  for  splitting  nodes  during  the 
growing  phase  implicitcly  guarantees  this  because  the  splitting  score  is  weighted  by  the 
experts  activation.  An  expert  that  is  hardly  ever  active  will  never  be  split  into  a  now 
subtree  which  is  exactly  what  we  want. 

Fig.  5.5  and  Fig.  5.6  compare  the  regions  of  activation  for  each  of  the  8  exports  in 
both  architectures.  Each  plot  was  obtained  by  sampling  the  export’s  activation  (product 
of  gating  probabilities  along  the  path  from  root  to  expert  node)  in  the  region  [0, 1]^.  White 
color  indicates  high  activation,  whereas  black  color  indicates  low  activation. 

5.3.2  Pruning 

Fig.  5.7  shows  the  effect  of  different  pruning  factors  during  training  on  the  final  classifica¬ 
tion  performance.  In  this  experiment  wo  chose  the  2-dimcnsional  feature  space,  consisting 
of  FI  and  F2,  because  the  difference  between  a  GLIM  and  an  IIME  in  terms  of  classifica¬ 
tion  performance  is  much  more  obvious.  The  IIME  consists  of  8  experts,  organized  in  a 
binary  tree  of  depth  3.  A  pruning  value  of  0.0  corresponds  to  no  pruning  at  all,  while  at 
a  value  of  almost  1.0  only  the  most  probable  export  is  evaluated. 
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Figure  5.5:  Expert  activations  for  standard  IIME 


Figure  5.6:  Expert  activations  for  grown  IIME 


Since  the  test  set  is  relatively  small,  measuring  the  classification  error  after  only  one 
training  run  is  not  very  representative,  because  different  initial  weights  influence  the  final 
performance.  Therefore,  we  computed  mean  and  standard  deviation  of  the  classification 
error  rate  over  20  training  runs,  for  each  setting  of  the  pruning  factor.  The  lowest  clas¬ 
sification  error  rate  over  a  maximum  of  30  iterations  was  computed  and  used  in  each 
training  run,  although  most  of  the  training  runs  converged  in  less  than  8  iterations.  Fi¬ 
nally,  Fig.  5.8  shows  the  impact  of  pruning  during  the  testing  of  an  IIME.  This  time,  the 
IIME  was  trained  without  pruning.  Different  pruning  thresholds  were  applied  during  the 
computation  of  the  mean  square  error  on  the  test  set.  We  chose  the  MSE  instead  of  the 
classification  rate,  since  the  test  set  is  too  small  to  give  significant  results  with  respect 
to  the  classification  error  rate  (and  because  GLIM  and  IIME  performances  arc  relatively 


Chapter  6 

Context  Modeling 


It  is  well  known  from  traditional  IIMM  based  speech  recognizers,  that  the  modeling  of 
phonetic  context  improves  recognition  accuracy  significantly  over  context-independent 
monophonc  models.  Incorporating  context  models  into  a  conncctionist  hybrid  IIMM  sys¬ 
tem  is  also  expected  to  boost  performance,  but  it  requires  a  different  approach,  since 
the  computation  of  class  likelihoods  is  not  distributed  among  separate  estimators,  but 
is  performed  by  computing  class  posteriors  using  one  big  neural  network.  This  chapter 
introduces  posterior  factoring  as  a  technique  to  model  phonetic  contexts  within  a  hy¬ 
brid  conncctionist  speech  recognizer  and  presents  a  parametric  clustering  algorithm  that 
creates  decision  tree  clustered  polyphonc  contexts. 


6.1  Phonetic  Context  Modeling 

In  a  system  with  n  monophoncs,  modeling  of  context  windows  of  width  d  would  require 
the  estimation  of  models  for  n.'^  classes,  which  is  not  feasible  in  practice  {n  w  50,  d  >  3). 
Usually,  phonetic  contexts  arc  hierarchically  clustered  according  to  a  distance  measure  be¬ 
tween  two  parametric  distributions.  The  most  popular  example  arc  generalized  triphones 
[32].  Systems  that  use  this  kind  of  modeling  cluster  the  set  of  all  possiblc/obscrved 
monophonc  triples  («  125000)  into  a  set  of  about  5000  —  10000  models.  This  approach, 
however,  considers  only  the  left  and  right  neighbors  of  a  monophonc.  More  recently, 
systems  have  emerged,  that  cluster  broader  contexts,  so  called  polyphoncs.  Whatever 
the  actual  context  modeling  is,  once  a  set  of  reasonable  context  classes  is  computed,  it 
remains  to  estimate  likelihoods  for  each  of  these  classes. 

A  mixture  of  Gaussians  based  context-independent  (Cl)  IIMM  system  can  be  aug¬ 
mented  to  a  context-dependent  (CD)  one  fairly  simple,  since  each  class  is  modeled  by 
a  separate  multivariate  Gaussian  mixture  and  density  estimation  of  one  context  class  is 
independent  of  all  the  other  classes.  As  far  as  the  acoustic  modeling  is  concerned,  it  only 
requires  a  much  larger  set  of  mixture  densities,  the  underlying  mathematical  framework 
docs  not  restrict  the  number  of  modeled  classes. 


55 


56 


CHAPTER  6.  CONTEXT  MODELING 


Augmenting  a  Cl  conncctionist  hybrid  IIMM  system  to  model  context  classes,  we  are 
facing  some  difficulties,  since  sealed  class  likelihoods  arc  computed  out  of  class  posteriors, 
which  in  turn  arc  computed  by  one  single  neural  network.  This  works  well  for  a  Cl  system 
with  only  about  50  classes,  but  it  is  computationally  not  feasible  to  model  a  set  of  over 
1000  context  classes  by  one  single  neural  network,  which  would  require  over  1000  output 
neurons.  Also,  such  a  network  would  compute  posteriors  for  all  of  the  context  classes  in 
each  frame,  although  most  of  them  will  never  be  used  by  the  decoder.  Training  such  a 
big  network  is  potentially  troublesome  and  would  require  too  many  training  epochs  to  be 
applicable  to  speech  domains  with  large  training  datasets. 


6.2  Factoring  Posteriors 

Fortunately,  posteriors  for  context  dependent  classes  can  be  modeled  by  multiple  neural 
networks,  each  of  which  containing  only  a  small  number  of  output  neurons.  Using  Bayes’ 
rule  and  standard  rules  for  conditional  probabilities,  the  context-dependent  monophonc 
likelihood  p(x|cj,a;i)  for  monophone  Wi  and  context  class  Cj,  which  is  required  by  the 
IIMM,  can  be  factored  in  separate  terms,  depending  on  the  state  topology. 


6.2.1  Single  State  Topologies 

In  a  system  where  each  context  class  is  modeled  by  a  single  IIMM  state,  the  emission 
probability  (likelihood)  to  be  estimated  in  each  frame  is  p(x|cj,u;i).  Using  Bayes’  rule, 
this  is  equal  to 


p(x|Cj,Wi) 


p(Cj,tUi|x)p(x) 


The  above  equation  can  bo  factored  as  follows  using  the  standard  rule  for  conditional 
probabilities 


p(cj,a.-,|x); 

P(cjN,x)  p(iv-.|x) 

P{c,\u;,)  P(a;.) 

As  usual,  p(x)  can  be  neglected  since  it  is  equal  for  all  context  classes  Cj  and  all 
monophoncs  cu;  given  a  particular  frame  x,  hence  it  will  not  affect  the  decisions  made  in 
the  decoder  because  the  <  relation  is  invariant  to  addition  of  constants. 

The  remaining  terms  in  the  numerators  are  posteriors,  which  can  be  approximated 
by  neural  nets,  while  the  terms  in  the  denominators  are  prior  probabilities  which  can  be 
estimated  based  on  the  frequencies  of  classes  in  the  training  set. 


p(x|Cj,CU,)  = 
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The  posteriors  p(u;,|x)  arc  conditioned  on  the  input  feature  vector  x  only  and  can  be 
approximated  by  a  neural  network  which  discriminates  between  all  the  monophoncs  in 
the  system. 

The  posteriors  p(cj|a/',,x)  are  conditioned  on  the  input  feature  vector  and  on  one  of 
the  monophonos  w,.  One  way  of  estimating  these  probabilities,  which  fits  neatly  in  the 
scheme  of  a  modular  neural  network  system,  is  to  train  separate  context  expert  networks 
for  each  of  the  monophonos.  The  context  export  for  monophonc  lOi  would  bo  a  network 
which  approximates  the  posteriors  p,(cj|x)  for  all  the  context  classes  of  monophonc  W;. 

Fig.  6.1  gives  an  overview  of  a  context  dependent  connoctionist  hybrid  system  for 
single  state  topologies. 


Context  Expert 
Network  for 
monophone  1 


Context  Expert 
Network  for 
monophone  N 


context  dependent  posterior  computation 

Figure  6.1:  Overview:  single  state  topology  hybrid  context  dependent  system 
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6.2.2  Multi  State  Topologies 

Generally,  acoustic  models  are  made  up  of  multiple  states  in  a  left-right  or  Bakis  IIMM 
model,  to  account  for  temporal  variations  in  the  modeled  speech  sound.  Today’s  state- 
of-the-art  recognizers  use  mostly  3-statc  and  5-state  left-right  IIMMs.  First,  consider  a 
context  independent  hybrid  conncctionist  IIMM  system.  There  are  two  ways  to  model 
multi-state  topologies  in  such  a  system:  The  first  one  is,  to  treat  all  the  state’s  of  all 
monophone  models  as  one  big  pool,  and  train  a  neural  network  to  discriminate  between 
all  of  them.  This  approach  requires  s*n  output  nodes  for  n  monophonos  using  s-stato 
models.  Instead,  wo  can  adhere  to  the  concept  of  modularity  and  factor  the  posterior 
class  probability  further. 

A  multi-state  IIMM  model  requires  the  computation  of  the  state,  monophone  and 
context  dependent  likelihood  p(x|cj,u;, -,5/;),  where  is  the  IIMM  state,  Cj  is  the  context 
class  and  w,  is  the  monophone.  Applying  Bayes’  rule  and  proceeding  as  in  the  case  of 
single  state  models,  wc  obtain; 

P{Cj,Ui,Sk\^ 

1 

p(cj,k;.|3^„x)  pjsklx) 

P{cj,COi\Sk)  P{sk)  ’ 

p(cj|wi,Sfc,x)  piu;i\sk,x)  p(3fc|x) 

P{u;i\sk)  T(s,)  ^ 

All  the  terms  in  the  denominators  arc  again  prior  probabilities,  which  wc  can  estimate 
by  relative  frequencies.  The  frame  probability  p(x)  can  be  dropped,  when  socking  the 
model  with  maximum  likelihood.  It  remains  to  compute  the  posteriors  in  the  numerators. 

Starting  from  the  right  side,  the  posteriors  p(s*;|x)  can  be  computed  by  a  single  neural 
network,  discriminating  between  the  states  in  a  s-statc  IIMM  topology.  Therefore,  wc 
call  this  network  a  state  discriminating  network  (SDN). 

The  posteriors  p(a;i|s/..,x)  arc  conditioned  on  the  IIMM  state  and  the  input  frame 
and  can  be  computed  by  a  set  of  s  networks,  one  for  each  IIMM  state.  These  networks 
discriminate  between  the  monophoncs  w,-,  given  a  particular  IIMM  state  Sk-  The  network 
for  state  Sk  computes  p/5(n!,  |x). 

The  posteriors  p{cj  jo;;,  s/;, x)  arc  conditioned  on  the  input  frame  x,  the  IIMM  state  Sk 
and  the  monophone  n-’;.  They  can  be  computed  by  a  matrix  of  networks  consisting  of  s 
times  n  networks  (s  is  the  number  of  states,  n  is  the  number  of  monophoncs).  Each  of 
these  networks  discriminates  between  all  the  context  classes  of  a  specific  monophone  in  a 
specific  state.  The  network  for  state  Sk  and  monophone  w;  therefore  computes  pki{cj\x). 

Fig.  6.2  gives  an  overview  of  a  context  dependent  conncctionist  hybrid  system  for  multi 
state  topologies. 


p{x\cj,u!i,Sk)  = 
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Figure  6.2:  Overview:  multi  state  topology  hybrid  context  dependent  system 


The  networks  depicted  in  Fig.  6.1  and  Fig.  6.2  look  like  single  layer  perceptrons,  but 
they  are  meant  to  represent  arbitrary  posterior  probability  estimators.  Computation  of 
a  specific  context  dependent  likelihood  p(x|cj,u;i,s*,)  requires  the  evaluation  of  three  net¬ 
works:  The  state  discriminating  network  (SDN),  one  of  the  monophone  export  networks 
and  one  of  the  context  export  networks.  Note,  that  the  context-dependent  hybrid  connoc- 
tionist  system  can  easily  bo  switched  back  to  context-independent  (Cl)  mode  by  turning 
off  the  context  export  networks,  a  feature  not  available  in  mixturo-of-Gaussians  based 
systems. 


6.2.3  Related  Work 

The  modeling  of  context  dependent  likelihoods  as  presented  in  this  thesis  most  closely 
resembles  the  work  in  [30]  and  [31],  with  the  notoablo  difference,  that  wo  have  generalized 
contoxt-dopondont  posteriors  to  multi-state  IIMM  models. 
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There  are  other  ways  of  factoring  a  conditional  posterior  probability.  For  instance, 
one  could  decompose  the  conditional  likelihood  for  a  one-state  IIMM  model  as  follows: 


p(x|cj,u;i) 


p(cj,u;i|x)p(x) 

P(cj,w,) 

p(u;.|cj,x)  p(cj|x) 
P(a;i  jcj)  P{^j) 


p{x) 


In  this  case,  context  specific  networks  are  trained  to  discriminate  between  the  mono¬ 
phones  uii,  given  a  specific  context  class  Cj.  Every  context  specific  network  performs  a 
simpler  task  than  a  context-independent  network.  This  approach  is  adopted  by  SRI  [13]. 
However,  it  is  less  attractive  to  us,  because  of  the  following  two  reasons:  (1)  One  can  not 
switch  between  Cl  and  CD  mode  and  (2)  discriminating  between  monophones  in  a  specific 
context  can  lead  to  poor  posterior  estimates,  when  some  monophones  occur  rarely  or  not 
at  all  in  this  context.  Furthermore,  as  we  will  see  in  the  next  chapter,  our  approach  of 
factoring  posteriors  allows  to  make  use  of  the  same  context  clustering  trees  that  are  used 
in  mixture-of-Gaussian  based  IIMM  systems. 

Yet  another  approach  was  adopted  by  Bourlard  and  Morgan  at  ICSI  [3].  Their  method 
factors  the  posterior  phone-in-context  probability  in  the  same  way  as  wo  presented  it. 
However,  their  system  uses  only  one  MLP  to  estimate  context  posteriors  instead  of  a  sot 
of  context  experts  as  proposed  earlier  in  this  thesis.  This  is  possible  by  giving  the  context 
MLP  extra  binary  inputs,  which  encode  the  current  monophone.  This  approach  has 
the  disadvantage  of  requiring  multiple  forward  passes  through  the  context  MLP  during 
recognition,  since  the  decoder  will  hypothesize  more  than  one  monophono  at  each  time 
stop,  which  leads  to  different  network  input  patterns. 


6.3  Polyphone  Clustered  Contexts 

Wo  have  presented  an  architecture  for  estimating  context  dependent  posterior  monophonc 
probabilities,  given  a  sot  of  context  classes.  Wo  have  not  yet  talked  about  how  wo  obtain 
those  contextual  classes.  The  remainder  of  this  chapter  will  present  polyphone  clustering 
using  decision  trees,  as  it  is  used  within  the  mixturo-of-Gaussians  based  JANUS  recog¬ 
nizor.  Wo  will  show,  that  the  resulting  context  clustering  trees  can  also  be  used  to  derive 
phonetic  context  classes  for  the  context  expert  networks  in  our  hybrid  framework. 


6.3.1  Polyphones 

Polyphonos  are  generalizations  of  the  well-ostablished  triphonos.  They  model  a  broader 
context  of  a  given  monophono.  For  instance,  the  word  ’BABYSITTING’  is  modeled, 
according  to  our  dictionary,  as  the  following  sequence  of  monophones: 
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B  -  EY  -  D  -  lY  -  S  -  III  -  DX  -  IX  -  NG 

If  we’d  model,  for  instance,  polyphonic  contexts  of  a  maximum  of  +/  —  2  phones,  the 
above  word  would  bo  modeled  as  a  sequence  of  the  following  polyphoncs; 


Monophonc 

Polyphone 

B 

*  -  *  -  B  -  EY  -  B 

EY 

*  -  B  -  EY  -  B  -  lY 

B 

B  -  EY  -  B  -  lY  -  S 

lY 

EY  -  B  -  lY  -  S  -  III 

S 

B  -  lY  -  S  -  III  -  DX 

III 

lY  -  S  -  III  -  DX  -  IX 

DX 

S  -  III  -  DX  -  IX  -  NG 

IX 

III  -  DX  -  IX  -  NG  -  * 

NG 

DX  -  IX  -  NG  -  * 

An  inventory  of  polyphoncs  can  be  extracted  from  large  text  corpora  and  stored  cfE- 
ciontly  in  a  set  of  binary  decision  trees,  one  for  each  monophonc.  It  should  be  obvious, 
that  the  number  of  polyphoncs  observed  in  a  given  large  text  corpus  is  far  too  high  to 
allow  separate  models  for  each  one  of  them.  In  fact,  many  of  the  observed  polyphoncs  do 
occur  only  once  in  the  training  set.  Additionally,  there  may  be  some  polyphoncs  in  an 
unseen  test  corpus,  which  were  not  present  in  the  training  corpus,  no  matter  how  big  the 
latter  was. 

Therefore,  we  need  to  apply  a  clustering  procedure,  which  reduces  the  number  of 
distinct  models  while  providing  full  coverage  of  unseen  new  test  data.  By  far  the  most 
popular  technique  is  to  use  decision  trees  with  questions  about  the  phonetic  context.  De¬ 
cision  trees  arc  very  appealing  because  they  guarantee  to  cover  all  phones  in  any  contexts, 
while  using  a  distance  measure  based  on  the  acoustic  data  to  split  nodes  and  grow  the 
tree. 

6.3.2  Decision  Tree  Clustering 

Decision  trees  arc  divisive  clustering  methods  making  use  of  binary  trees  asking  questions 
at  each  internal  node.  Associated  with  a  decision  tree  is  a  finite  set  of  questions  which 
can  be  answered  with  yes  or  no.  The  children  nodes  of  each  internal  node  correspond  to 
the  two  possible  answers  to  the  particular  question  asked.  Starting  with  a  tree  containing 
only  the  root  node,  succcsivc  splits  arc  applied  to  grow  the  tree  to  a  desired  size. 

The  iterative  tree  growing  procedure  works  as  follows:  Initially,  all  the  acoustic  train¬ 
ing  data  is  associated  with  the  root  node.  In  each  growing  stop,  a  preliminary  split  is 
computed  for  all  of  the  leave  nodes  and  all  the  possible  questions,  that  can  be  asked. 
Each  of  these  preliminary  splits  is  scored  using  a  distance  measure  which  models  the 


62 


CHAPTER  6.  CONTEXT  MODELING 


goodness  of  the  split.  The  leave  node  with  the  best  score  is  then  split,  while  all  the  other 
preliminary  splits  are  discarded.  The  training  data  associated  with  the  node  being  split, 
is  distributed  among  the  children  nodes  according  to  the  answers  to  the  actual  question 
being  used.  The  distance  measure  used  to  score  the  preliminary  node  splits  is  very  much 
dependent  on  the  representation  of  the  data.  In  [30], [31],  unimodal  multivariate  Gaus- 
sians  with  diagonal  covariance  matrices  are  used  to  model  the  data  in  each  leave  node. 
They  use  the  gain  in  log-likelihood  duo  to  the  data  being  split  as  the  distance  measure. 
This  involves  the  estimation  of  diagonal  covariance  matrices  for  each  hypothesized  node 
split: 


AL  =  nlog|S|  -  (uiloglE,]  -l-nrlog|Sr|) 

whore  n  is  the  number  of  samples  associated  with  the  parent  node,  u;  and  Ur  are 
the  number  of  samples  associated  with  the  children  nodes,  respectively,  S  is  the  diagonal 
covariance  matrix  of  the  data  in  the  parent  node  and  Ej  and  Er  are  the  diagonal  covariance 
matrices  of  the  data  in  the  children  nodes,  respectively. 

Once  a  decision  tree  for  a  particular  monophone  is  grown  to  a  desired  size,  its  loaves 
represent  the  context  classes  of  that  monophone  and  arc  labeled  accordingly. 


6.3.3  Entropy  based  Clustering 

The  distance  measure  used  in  [30], [31]  requires  the  estimation  of  covariance  matrices  for 
each  hypothesized  node  split  using  all  the  acoustic  data  associated  with  the  nodes  involved 
in  the  split.  This  can  be  very  expensive,  especially  when  the  training  dataset  and  the  set 
of  questions  arc  large. 

Phonetic  context  decision  trees  in  JANUS  arc  grown  using  a  distance  measure  that 
docs  not  depend  on  the  acoustic  training  data  directly.  Instead,  the  mixture  coefficients 
of  the  context  independent  Gaussian  mixtures  arc  interpreted  as  discrete  distributions 
over  a  vector  quantized  feature  space,  represented  by  the  codebooks  of  Gaussians.  When 
hypothesizing  a  new  split,  discrete  distributions  over  the  same  monophone  codebook  arc 
computed  for  the  two  hypothesized  children  nodes.  To  score  the  goodness  of  the  split, 
the  gain  in  entropy  using  separate  distributions  for  the  children  nodes  is  computed. 


^(P.PhPr) 
with  //i(p() 

NriPr) 


=  niIIi(pi)  +  n,.nr(Pr}-nII(p) 

=  -J^Puiogpii 
i 

=  -J^Pri^OSPri 


^Hp)  =  -Z^P.logPi 


#  context  models 
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In  the  above  equation,  the  sums  go  over  the  number  of  coefficients  in  the  discrete 
probability  distributions.  Using  the  above  distance  function  to  score  splits  in  a  decision 
tree  is  efficient  and  appealing  from  an  information  theoretic  point  of  view,  since  the 
above  splitting  score  can  be  interpreted  as  the  mutual  information  between  children  nodes 
distribution. 

6.3.4  Analyzing  Cluster  Trees 

To  show  properties  of  the  splitting  criterion,  we  created  cluster  trees  for  the  ESST  speech 
task  with  5  dilferent  numbers  of  overall  context  models  :  500,  1000,  1500,  2000  and  2500. 
The  ESST  speech  task  is  an  English  spontaneous  speech  database  which  we  also  use  for 
the  evaluation  of  the  hybrid  speech  recognizer  (see  Chapter  8  for  details). 

For  each  of  the  5  cluster  trees,  we  computed  the  number  of  context  models  generated 
for  each  monophono  (over  all  states  of  a  3-stato  loft-right  IIMM  model).  Pig.  6.3  shows 
the  evolution  of  the  number  of  context  models  over  the  5  cluster  phases  and  the  52 
monophonos  in  our  system. 


monophone 


Figure  6.3;  Distribution  of  context  models 

It  is  remarkable  that  the  trees  for  the  monophonos  N,  T  and  lY  together  contain  about 
20%  of  all  cluster  models  (2500)  over  all  trees.  Fig.  6.4  shows  a  typical  decision  tree.  It 
was  build  for  the  middle  state  of  a  throe-state  model  of  the  monophono  AX.  It  is  part  of 
a  forest  of  156  decision  trees  (52  monophonos  times  3  states)  with  an  overall  number  of 
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1000  context  models.  The  polyphonic  context  is  restricted  to  the  3  phonos  left  and  right 
of  a  midphono.  The  sot  of  all  possible  questions,  that  were  available  for  the  generation  of 
the  tree  is  listed  in  Appendix  A. 


Figure  6.4:  Decision  tree  for  monophone  AX-m 

Obviously,  the  clustering  process  favours  questions  about  the  immediate  right  or  loft 
neighboring  phono.  This  is  consistent  with  our  intuition  that  the  influence  of  context  is 
decreasing  with  increasing  neighborhood  distance.  Nevertheless,  the  tree  in  Fig.  6.4  also 
uses  questions  about  the  broader  context.  It  oven  asks  a  question  about  a  phono  that  lies 
3  frames  in  the  future,  although  such  questions  generally  occur  only  in  the  lower  parts 
of  the  trees.  That  moans,  that  it  is  in  fact  helpful  to  consider  broader  contexts  than 
just  triphonos.  In  the  beginning  of  node  splitting,  the  tree  concentrates  on  neighboring 
contexts,  but  when  the  trees  got  bigger,  the  splitting  process  starts  to  use  broader  context 
questions  as  well. 


Chapter  7 

Mixtures  of  Gaussian  Experts 


Until  now,  wc  have  assumed  a  generalized  linear  model  in  both  gates  and  experts  of  an 
Hierarchical  Mixture  of  Experts,  although  the  architecture  in  principle  allows  arbitrary 
parametric  forms  of  gates  and  experts.  In  the  case  of  classification,  however,  the  models 
for  gates  and  experts  have  to  fullfil  the  constraint,  that  their  output  activations  sum  up 
to  one  for  each  input  frame.  Recently,  Xu,  Jordan  and  Hinton  [57]  have  proposed  to  use 
a  parametric  form  based  on  Gaussian  kernels  for  the  gates.  Wc  will  further  develop  their 
work,  showing  that  the  same  parametric  form  can  be  used  for  experts  as  well.  Such  an 
architecture  is  very  attractive  because  it  can  be  initialized  to  a  near  optimal  solution  very 
efficiently,  thus  reducing  convergence  time  of  the  learning  algorithm. 


7.1  Alternative  Parameterization 

Instead  of  applying  a  generalized  linear  model  with  softmax  nonlinearity,  the  following 
parameterization  was  proposed  for  the  gate  in  a  onc-lcvcl  mixture  of  experts  architecture 

([57]): 


sKx, v) 

P(xlvO 


a;,P(x|vi) 


with 


E/..afcP(x|vu)  ^ 


^  afc  =  1  and  >  0 


(27r)”/2|Si|>/2 


cxp{-l/2(x  -  Mx  -  Mi)} 


This  form  of  a  gate  is  legal,  since  the  jfj’s  by  definition  sum  up  to  one,  thus  providing 
a  partition  of  unity  for  each  input  feature  vector  x.  The  above  parametric  form  can  be 
interpreted  as  a  parametric  a-posteriori  classifier  according  to  Bayes  theorem; 


p(w,|x) 


P{u:i)p{x\uii) 

12k  ^(‘^fc)p(x|Wfc) 
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where  the  prior  probabilities  P(!^i)  arc  the  a,’s  and  the  class  likelihoods  p(x|u;i)  arc 
modeled  by  single  Gaussian  distributions. 


7.2  Gaussian  Classifier  as  Gate 

Parameterizing  the  gate  of  a  mixture  of  experts  as  a  Gaussian  a-postcriori  classifier  allows 
to  derive  an  efficient  single-loop  EM  algorithm  to  estimate  the  parameters  of  the  gate. 
Additionally,  the  special  parametric  form  allows  to  initialize  the  Gaussian  kernels  and 
a-priori  probabilities  which  speeds  up  training  times  significantly. 

7.2.1  EM  algorithm 

The  conditional  mixture  underlying  a  mixture  of  exports  is 


P(y|x,©)  =  X^Sf,Pi(y|x,©i) 


If  we  attempt  to  derive  an  EM  algorithm  directly  on  this  mixture  density,  we  find  that 
the  M-step  is  not  analytically  solvable  and  would  require  iterative  processing,  similar  to 
the  IRLS  algorithm.  However,  the  above  conditional  mixture  can  bo  rewritten  in  a  form, 
that  allows  an  analytical  solution  for  the  ML  problem; 


P(y,x)  =  P(y|x,©)P(x,v)  =  ^a,P(x|vi)Pi(y|x,  ©;) 

i 

Instead  of  estimating  the  gating  parameters  to  maximize  the  likelihood  of  the  original 
mixture  density,  wo  can  maximize  the  likelihood  of  the  above  joint  density.  Applying  the 
EM  algorithm  in  a  similar  way  as  wo  did  in  the  case  of  generalized  linear  models  leads  to 
the  following  iterative  estimation  method: 


(1)  E-step  For  each  training  vector,  compute  the  posterior  node  probabilities  /).;  accord¬ 
ing  to 


Qp)p(xW|r.py;(y(«)|xW,ep^) 
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(2)  M-step  Use  the  hi’s  to  compute  new  estimates  for  the  parameters  a,,  and  S; 
of  the  gate.  The  now  estimates  can  be  computed  directly,  since  the  ML  problem  is 
now  analytically  solvable: 


E</^F’(yWlxW) 

E,  fefVyW|xW)  [x(^)  -  |x(^)  - 

E<  /ip'(y<*’|xW) 

The  ML  problem  for  the  exports  remains  analytically  unsolvablc  (in  the  case  of  clas¬ 
sification)  and  those  parameters  must  be  estimated  either  iteratively  by  gradient  ascent 
or  by  the  least  squares  heuristic.  However,  the  above  EM  algorithm  for  gates  is  computa¬ 
tionally  more  efficient  than  the  IRLS  algorithm  for  GLIMs.  Note,  that  the  computation 
of  node  posteriors  hi  has  changed  compared  to  the  EM  algorithm  for  GLIMs,  This  indi¬ 
rectly  influences  the  estimation  of  export  parameters  also,  since  the  joint  node  posteriors 
appear  in  the  rc-cstimation  formulas  for  exports. 

Note  also,  that  the  above  formulation  of  the  EM  learning  does  maximize  the  sum  of 
the  mixture  likelihood  and  the  conditional  likelihood  of  the  gate  instead  of  maximizing 
the  mixture  likelihood  itself.  During  testing,  however,  the  output  of  the  mixture  still 
follows  the  mixture  model  of  IIME’s. 

7.2.2  Initialization 

The  parametric  form  which  wc  have  applied  to  the  gate  is  very  attractive  because  it  allows 
the  initialization  of  parameters  to  near  optimal  values.  There  is  a  significant  body  of  work 
on  the  initialization  of  Gaussian  mixture  models  and  radial  basis  function  networks  which 
can  be  adopted  here  as  well.  In  fact,  since  we  already  know,  that  the  parametric  form 
can  be  viewed  as  a  Gaussian  a-posteriori  classifier,  its  parameters  can  best  be  initialized 
by  estimating  priors  and  class  likelihoods  by  relative  frequencies  and  maximum  likelihood 
estimation,  respectively.  However,  in  the  case  of  a  gate  in  a  mixture  of  experts,  we  do 
not  have  class  labels  to  estimate  the  parameters  of  a  Gaussian  classifier  the  way  wc  just 
proposed  (nevertheless,  this  technique  will  gain  importance  later,  when  we’ll  use  Gaussian 
classifiers  as  experts  also). 

One  possible  initialization  technique  for  Gaussian  gates  that  works  very  well  in  practice 
is  to  estimate  the  parameters  such  that  the  likelihood  of  the  data  under  an  unsupervised 
mixture  model  is  maximized.  That  means,  wc  initalizc  the  parameters  of  the  gate  accord¬ 
ing  to 


.L+i) 


-L+i) 
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Vi  -  arg  max ^  log  ^  jv;) 

t  i 

with  J2k  Ck/!  =  0  and  a*..  >  0.  Usually,  maximizing  such  a  likelihood  is  done  in  the 
following  three  stops: 

(1)  Extract  Samples  Initialize  the  means  of  the  Gaussians  by  extracting  the  appro¬ 
priate  number  of  samples  randomly  from  the  training  set. 

(2)  Cluster  Means  Apply  a  clustering  algorithm  such  as  the  k-means  or  LEG  algorithm 
to  the  means.  This  corresponds  to  minimizing  the  distortion  of  a  discrete  vector- 
quantized  distribution  where  the  codebook  vectors  arc  the  means. 

(3)  Maximum  Likelihood  Iteratively  recstimate  the  mixture  coefficients  o:,-,  the  means 
fii  and  the  covariance  matrices  S,-  according  to  the  EM  algorithm  for  Gaussian 
mixtures  [10]. 

The  possibility  to  initalizc  the  gate  parameters  to  near  optimal  solutions  and  the 
single-loop  EM  re-estimation  algorithm  render  the  Gaussian  parameterization  a  powerful 
extension  to  the  standard  IIME  architecture. 

7.2.3  Combining  Multiple  Classifiers 

There  is  one  other  application  of  Gaussian  gates,  namely  the  task  of  combining  multiple 
classifiers  (CMC).  Suppose  we  have  n  different  kind  of  pro-trained  classifiers,  all  trained 
on  the  same  data  set.  Since  each  of  the  classifiers  might  have  learned  different  parts 
of  the  data  best,  it  is  generally  a  good  idea  to  combine  their  estimates,  if  we  have  a 
combination  method  capable  of  supporting  the  good  and  suppressing  the  bad  classifiers 
for  each  training  sample. 

The  problem  can  be  treated  as  a  special  ease  of  a  mixture  of  experts,  where  the  experts 
parameters  remain  fixed  and  only  the  gates  arc  iteratively  adapted.  The  single-loop  EM 
algorithm  can  therefore  be  directly  used  to  estimate  the  gate  parameters.  It  was  shown 
in  [57]  that  this  can  increase  overall  performance  considerably,  while  avoiding  the  costly 
re-estimation  of  the  expert  classifiers.  This  makes  this  technique  even  more  attractive  for 
our  purpose  in  speech  recognition,  since  we  have  to  deal  with  large  datasets  consisting  of 
millions  of  feature  vectors. 


7.3  Mixture  of  Gaussian  Experts 

Given  the  advantages  of  the  Gaussian  parameterization  of  the  gate,  it  would  be  nice,  if 
we  could  use  the  same  parameterization  for  the  experts  as  well.  Also,  we  would  like  to 
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generalize  the  technique  to  hierarchical  mixtures  with  more  than  one  gate.  Unfortunately, 
the  solution  to  the  EM  learning  problem  proposed  in  [57]  does  not  generalize  to  exports. 
We  will  therefore  relax  the  EM  constraint  and  derive  a  generalized  EM  algorithm  that 
only  guarantees  to  increase  the  mixture  likelihood  in  each  iteration,  instead  of  maximizing 
it. 

7.3.1  Gaussian  Classifiers  as  Experts 

The  parametric  form  based  on  Gaussian  kernels  is  even  more  attractive  for  experts  than 
it  is  for  gates.  The  reason  is,  that  in  the  case  of  exports,  wo  have  class  labels  for  the 
initialization  available.  This  simplifies  the  initialization  of  expert  parameters,  since  each 
Gaussian  kernel  can  be  estimated  independently  on  a  subset  of  the  data.  Given  that 
the  gate  is  already  initialized,  the  initalization  of  the  experts  requires  just  a  single  pass 
through  the  training  data,  yet  yielding  parameter  estimates  which  give  the  mixture  an 
initial  performance  that  is  close  to  the  optimal  one,  even  before  applying  any  kind  of 
training  algorithm  to  the  whole  architecture. 

7.3.2  GEM  algorithm 

As  promised,  we  will  now  derive  a  generalized  EM  algorithm  for  a  mixture  of  experts 
which  uses  Gaussian  parameterizations  exclusively.  The  probability  model  of  the  overall 
architecture  is 


P(y|x,©)  =  X^ft(y|x,Vi)Pi(y|x,©i) 

i 

where  the  P,-  are  multinomial  densities,  modeling  the  multiway  classification  task  im¬ 
posed  on  the  experts  and  the  v;  and  &\  are  the  sots  of  parameters  for  gate  and  experts, 
respectively.  The  expert  activations  arc  computed  the  same  way  as  the  gate  activations, 
assuming  a  Gaussian  a-postcriori  classifier: 


yij(x,©,)  =  =  l  and  Q,/:  >  0 

"  (27r)»/2|S,-,|V2  {- -  Pa)} 

The  expert  activations  can  be  re-written  in  an  interesting  form: 

EfcCXp(2rifc) 

with  Zij  =  log(aij)  -  I  [nlog(2Tr)  -|-  log\Sij\  -|-  (x  -  mjf  Sj^.^x  - 
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Wc  have  expressed  the  expert  activations  using  the  same  ’softmax’  nonlinearity  as  in 
the  GLIM  ease.  The  difference  is,  that  wc  changed  the  underlying  linear  model  which 
computes  a  =  WX  too.  radial  model,  which  basically  computes  «  =  {X  —  Wy.  Expressing 
the  new  model  in  terms  of  the  softmax  function  allows  us  to  unify  linear  and  radial  expert 
models. 

The  M-step  of  the  EM  algorithm  for  mixtures  of  experts  involves  the  maximization  of 
the  following  two  likelihoods  (assuming  a  multinomial  probability  model) 

=  ?yrgm^x'£y£^hf\oggf 

t  3 

log  y\f 

'  <  i 

where  the  are  targets  for  the  expert  output  nodes.  Because  of  the  nonlinearity  of 
the  softmax  function  in  both  g  and  there  is  no  closed-form  solution  to  this  problem. 
Therefore,  we  derive  a  GEM  algorithm  which  increases  the  likelihoods  using  gradient 
ascent 
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where  dy  is  the  Kroneckcr  symbol,  r\  is  the  learning  rate,  and  the  Zi  are  the  linear  or 
radial  functions  prior  to  the  softmax  nonlinearity. 

In  the  case  of  Gaussian  experts  with  diagonal  covariance  matrices,  wc  obtain  the 
following  update  rules  for  the  parameters  of  a  specific  expert  E,: 
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The  a^’s  need  to  bo  normalized  after  each  iteration,  in  order  to  fulfill  the  constraint, 
that  their  sum  yields  one.  To  speed  up  convergence,  it  is  possible  to  use  this  algorithm 
in  a  stoachastic  gradient  based  version,  updating  the  parameters  each  time  M  training 
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samples  have  been  presented.  The  presented  GEM  algorithm  is  basically  a  first-order 
technique,  therefore,  the  reader  may  argue  that  convergence  speed  might  be  too  slow 
to  render  this  algorithm  useful.  However,  we  will  show,  that  the  combination  of  this 
algorithm  with  the  initalization  technique  presented  above  yield  very  fast  convergence  in 
practice. 

7.3.3  BBI  Trees  for  Pruning 

In  [16] ,  we  presented  a  binary  tree  based  space  partioning  algorithm  which  is  very  effective 
in  speeding  up  the  evaluation  of  Gaussian  mixtures  with  diagonal  covariance  matrices. 
This  algorithm  partitions  the  feature  space  in  a  set  of  2'^  so  called  buckets  by  means  of 
hyperplanes  orthogonal  to  one  of  the  coordinate  axis.  Given  a  particular  feature  vector  x, 
the  algorithm  is  able  to  determine  the  bucket,  in  which  the  vector  resides,  with  just  a  few 
scalar  comparisons.  Having  determined  the  correct  bucket,  a  reduced  list  of  Gaussians, 
which  is  computed  in  advance,  is  evaluated  instead  of  the  whole  mixture. 

This  algorithm  can  easily  be  applied  to  speed  up  a  Gaussian  classifier  based  hierarchical 
mixture  of  experts,  if  the  diagonal  covariance  assumption  holds.  First,  we  compute  a 
BBI  space  partioning  tree  for  each  of  the  Gaussian  classifiers  (each  node  in  the  MGE 
tree).  During  training  or  testing,  when  the  MGE  nodes  arc  asked  to  compute  posterior 
probabilities,  the  BBI  trees  are  used  to  determine  a  reduced  set  of  Gaussians,  which 
contribute  more  than  a  specific  threshold.  Only  these  Gaussians  arc  then  evaluated,  all 
the  remaining  ones  are  pruned  to  an  activation  of  0.0.  This  technique  can  be  seen  as 
a  form  of  MGE  tree  pruning,  if  applied  to  gating  nodes,  where  each  Gaussian  in  the 
gate  classifier  corresponds  to  one  of  the  children  nodes.  We  found  that  BBI  trees  for 
MGE  pruning  arc  particularly  useful  for  MGE  topologies  with  a  high  branching  factor. 
The  overhead  of  pro-  computing  BBI  trees  for  each  MGE  node  is  ncglcctablc  during  the 
training  of  MGE’s.  For  testing,  the  BBI  trees  only  have  to  be  computed  once  and  can  be 
stored  together  with  the  remaining  MGE  tree  parameters. 


7.4  Experiments 

We  trained  a  GLIM-  and  a  Gauss-classifier  based  mixture  of  experts  on  the  Peterson  & 
Barney  vowel  data,  to  compare  the  two  parameterizations.  The  architecture  was  the  same 
in  both  eases,  a  1-lcvcl  tree,  featuring  1  gate  and  10  experts.  We  chose  the  branching 
factor  of  the  tree  to  be  the  number  of  output  classes,  because  this  allows  an  even  faster 
initialization  scheme  for  the  MGE  than  presented  so  far.  Initialization  for  the  MGE 
proceeds  in  two  steps  (requiring  two  iterations  through  the  training  data); 

(1)  Estimate  parameters  of  a  single  Gaussian  expert.  Expand  the  tree  to  a  1-lcvcl,  10 
children  architecture,  switching  the  Gaussian  expert  to  a  Gaussian  gate  and  freeze 
its  parameters. 
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(2)  Estimate  parameters  of  the  10  new  experts,  using  the  gate  activations  as  observation 
weights. 

After  the  initalization,  we  train  the  architecture  using  the  GEM  algorithm,  presented 
earlier.  Fig.  7.1  shows  the  log-likelihood  on  the  training  set  for  an  MGE  and  an  IIME. 
Fig.  7.2  shows  the  moan  square  error  on  the  tost  sot  for  the  same  training  run. 


Log-litceiihood  for  1-10  GLIM-  and  Gauss-MME's 


Figure  7.1:  Evolution  of  log- likelihood  for  IIME  and  MGE  during  training 


0  6  10  15  20 

epochs 

Figure  7.2:  Evolution  of  MSE  for  IIME  and  MGE  during  trainig 

The  first  two  iterations  for  the  MGE  consist  of  initializing  the  parameters.  The  perfor¬ 
mance  of  the  MGE  after  initialization  is  already  very  high,  yet  the  following  GEM  training 
can  improve  performance  further.  Note,  that  the  initialization  phase  for  the  MGE  is  tak¬ 
ing  considerably  less  time  than  a  regular  GEM  or  EM  iteration,  where  we  have  to  compute 
node  and  branching  posteriors.  Taking  this  into  account,  the  MGE  compares  favourably 
to  a  same-size  IIME. 


Chapter  8 
Evaluation 


8.1  Hybrid  Janus 

This  section  briefly  introduces  the  hybrid  IIME/IIMM  speech  recognition  system,  that  was 
developed  during  this  thesis.  As  a  starting  point  of  this  work,  there  was  a  fully  functional 
continuous-density  IIMM  speech  recognizer  available  -  JANUS-SR  version  3.  This  system 
integrates  the  basic  recognizer  modules,  such  as  feature  extraction,  acoustic  modeling, 
language  modeling  and  the  decoder.  The  goal  of  this  thesis  was,  to  implement  a  complete 
new  acoustic  scoring  module  based  on  IIME’s  for  JANUS,  which  can  be  used  stand-alone 
or  in  combination  with  the  existing  mixture-of-Gaussians  scoring  module.  Version  3  of 
the  JANUS  recognizer  was  constructed  as  a  speech  recognition  toolbox,  exporting  all  the 
relevant  data  structures  and  methods  in  an  object  oriented  fashion,  using  the  Tcl/Tk 
toolkit  as  the  user  front-end. 

8.1.1  General  Concept 

The  JANUS  recognizer  implements  acoustic  scoring  by  a  generic  object,  called  ’stream’. 
A  system  can  contain  one  or  more  of  such  streams.  Each  stream  can  be  trained  and 
asked  for  estimates  of  model  likelihoods.  One  important  concept  in  JANUS  is,  that 
the  streams  are  responsible  for  the  modeling  of  basic  acoustic  units.  All  other  modules 
interface  with  the  streams  by  tagged  sequences  of  phones.  This  allows  the  use  of  different 
context-models  by  different  streams  and  facilitates  the  integration  of  a  connectionist  score 
computer.  For  instance,  a  tied-state  continuous  density  mixturo-of-Gaussians  scoring  with 
typically  about  5000  context  models  can  easily  be  combined  with  a  context-independent 
connectionist  a-posteriori  scoring. 

The  hybrid  system,  developed  for  this  thesis,  allows  contoxt-indopendent  and  context- 
dependent  connectionist  (IIME)  scoring  of  multi-state  IIMM’s,  using  decision  trees  to 
cluster  models.  Pig.  8.1  gives  an  overview  of  the  connectionist  part  of  the  hybrid  JANUS 
system. 
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Figure  8.1:  Overview:  Modules  of  hybrid  JANUS  recognition  system 


The  IlmcStream  object  realizes  model  clustering,  score  computation  and  training  by 
refering  to  the  IlmcSct  object.  The  llmeSet  object  contains  a  set  of  Hme  objects  for  con¬ 
text  independent  and  context-dependent  modeling.  The  IlmeSet  object  also  manages  the 
distribution  of  training  and  testing  frames  to  the  required  lime  objects.  An  lime  object 
realizes  an  arbitrary  hierarchical  mixtures  of  experts  tree  (arbitrary  topology).  It  con¬ 
tains  gate  and  expert  nodes,  which  in  turn  contain  Classifier  objects.  Right  now,  3  types 
of  Classifier  objects  arc  available  in  JANUS:  Standard  GLIM’s  as  proposed  for  IIME’s 
by  Jordan  &  Jacobs,  Gauss  classifiers  necessary  to  build  Mixtures  of  Gaussian  experts 
(MGE)  and  two-layer  pcrccptrons  (MLP).  The  concept  of  allowing  arbitrary  classifiers  as 
IIME  nodes  generalizes  the  original  idea  of  IIME’s  which  was  entirely  based  on  GLIM’s. 
More  classifier  types  can  easily  be  added  to  JANUS,  giving  a  great  deal  of  flexibility  to 
IIME  objects.  Also,  non-modular  approaches  like  ICSI’s  single  MLP  hybrid  system  can 
be  modeled  by  single  node  IIME’s.  Apart  from  being  used  as  IIME  nodes,  all  the  classifier 
types  export  their  functionality  through  the  user  interface,  which  allows  to  use  them  for 
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other  speech-  or  even  non-spccch  related  purposes  as  well. 

When  computing  scores  or  updating  parameters,  the  IlmcStream  refers  to  a  IlmcTrcc 
object  to  cluster  phonetic  contexts  to  model  names.  In  the  context-independent  case, 
this  decision  tree  is  degenerated  to  a  decision  list.  Once  phonetic  contexts  arc  resolved 
to  model  names,  the  IlmcStream  hands  them  down  to  the  IlmcSot  object  which  refers 
to  a  IlmcMapList  object  to  map  model  names  to  the  appropriate  IIME  and  output  node 
identifiers. 


8.2  Task  Description 

To  evaluate  the  system,  we  use  the  English  Spontaneous  Scheduling  Task  (ESST),  a  2500 
word  spontaneous  speech  database  in  the  domain  of  meeting  negotiation.  The  database 
consists  of  roughly  8000  utterances  (26  hours  of  speech),  recorded  at  a  sampling  rate  of 
16  kllz.  Typical  examples  of  utterances  arc 

I  I  MEANT  MAY  TWENTY  SIXTH  ARE  YOU  AVAILABLE  MAY  TWENTY 
SIXTH  BECAUSE  MAY  THIRTY  FIRST  TO  JUNE  THE  SECOND  I’LL 
BE  OUT  OF  TOWN 

OKAY  WE  NEED  TO  SCHEDULE  ANOTHER  MEETING  MY  WEEK  ISN’T 
LOOKING  THIS  WEEK  ISN’T  LOOKING  TOO  BAD  MONDAY  I’M  FREE 
IN  THE  AFTERNOON  AND  TUESDAY  I’M  FREE  IN  THE  MORNING  SO 
I  GUESS  WE’LL  START  WITH  THAT  AND  I’LL  SEE  HOW  YOUR 
SCHEDULE  IS 

The  database  features  lots  of  spontaneous  effects,  such  as  false  starts,  stuttering  and 
incomplete  sentences.  It  contains  a  roughly  equal  amount  of  male  and  female  speakers. 
The  utterances  were  recorded  under  low  noise  conditions  using  close  talking  headset  mi¬ 
crophones.  Nevertheless,  the  recordings  contain  a  considerable  amount  of  human  (coughs, 
breathing)  and  non-human  (key  clicks,  electronic  hum)  noise. 


8.3  General  System  Description 

The  feature  space  for  the  system  is  cepstrum  based.  ADC  data  is  prcproccsscd  in  the 
following  steps; 

(1)  Detect  Speech  primarily  based  on  signal  power.  Use  this  feature  to  suppress  non- 
speech  segments. 


(2)  Compute  short-time  FFT  over  16ms  windows  at  a  frame  rate  of  100  framcs/scc. 
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(3)  Convert  frequency  scale  into  a  log  mclscalc  with  30  coefficients. 

(4)  Compute  cepstrum  with  13  coefficients. 

(5)  Compute  delta  and  delta-delta  features  and  merge  them  with  cepstrum  and  some 
ADC  features  like  power  and  zero  crossing  rate. 

(6)  Apply  context-independent  LDA  and  shrink  the  resulting  47  dimensional  vector  to 
the  32  most-significant  coefficients. 

(7)  In  some  experiments,  we  did  merge  a  5-frame  window  of  3 2- dimensional  features  to 
a  160-dimonsional  feature  to  provide  more  context  information  for  the  networks. 

Since  the  IIME’s  require  supervised  training,  we  need  to  generate  alignment  paths  for 
each  training  utterance,  which  in  turn  provide  targets  for  each  frame.  There  are  many 
ways  of  computing  training  alignments  for  a  conncctionist  system.  A  purely  conncctionist 
hybrid  system,  however,  requires  iterative  training,  where  the  system  of  a  previous  itera¬ 
tion  itself  is  used  to  align  the  training  data  for  the  next  iteration.  There  arc  two  major 
drawbacks  of  this  kind  of  training.  It  requires  many  iterations  and  a  consistent  stopping 
criterion,  and,  it  relics  heavily  on  reasonable  initial  network  parameters.  Some  researchers 
accomplish  the  latter  by  pre-training  the  networks  on  a  hand-labeled  phonetic  database 
such  as  TIMIT. 

We  use  a  different  training  scheme.  Since  our  recognizer  integrates  conncctionist  and 
mixturc-of-Gaussians  based  scoring,  it  is  relatively  easy  to  use  a  well-trained  Gaussian  rec¬ 
ognizer  to  align  the  training  data  for  the  hybrid  system.  Therefore,  we  compute  alignment 
paths  for  each  training  utterance  and  save  them  to  disk.  These  paths  arc  subsequently 
used  as  targets  for  the  NN  training.  We  found,  that  this  training  scheme  worked  very  well, 
although  ultimately,  we  might  gain  performance  by  re-training  the  networks  on  alignments 
that  were  generated  by  the  (trained)  hybrid  system. 

All  experiments  were  carried  out  using  a  3-statc  IIMM  left-right  topology  and  51 
monophoncs.  The  resulting  setup  for  the  IlmcStrcam  therefore  was  as  follows:  1  state 
discriminating  IIME,  3  monophonc  IIME’s  and  a  maximum  of  153  context  modeling 
IIME’s  for  context-dependent  systems. 

The  systems  arc  evaluated  in  terms  of  word  accuracy  (WA),  substitution  (S),  deletion 
(D)  and  insertion  (I)  rates,  using  a  set  of  291  test  utterances  which  were  kept  apart  from 
the  training  data.  The  number  of  training  iterations  performed  and  the  size  of  the  system 
in  terms  of  the  number  of  acoustic  modeling  parameters  arc  reported  also. 


8.4  Cl  Systems 

We  trained  several  systems,  based  on  different  IIME  architectures  and  different  IIME 
node  classifiers  to  evaluate  the  hybrid  system.  We  started  to  experiment  with  context- 
independent  hybrid  IIME  systems  and  investigated  the  following  architectures: 
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•  GLIM  nodes:  Trees  of  depth  2  with  a  branching  factor  of  4.  Gate  and  expert 
nodes  were  generalized  linear  models. 

•  Gaussian  nodes:  Trees  of  depth  1  with  a  branching  factor  of  52,  which  is  the 
number  of  monophoncs  in  the  system.  The  branching  factor  was  chosen  as  the 
number  of  monophones  to  be  able  to  use  the  fast  initialization  technique  for  MGE’s 
that  we  presented  earlier. 

•  Growing  trees:  Trees  with  a  constant  branching  factor  of  4  and  GLIM  nodes, 
adaptively  grown  with  the  constructive  method  presented  in  this  thesis.  The  trees 
were  grown  until  they  contained  the  same  number  of  exports  (16)  as  the  other  GLIM 
based  architecture.  To  speed  up  the  tree  growing  phase,  we  used  a  restricted  training 
set  of  about  one  tenth  of  all  training  utterances.  However,  the  grown  architecture 
was  then  retrained  on  the  whole  training  sot. 

•  MLP  nodes:  Trees  of  depth  1  with  a  branching  factor  of  4  and  2-laycr  MLP  nodes. 
Each  MLP  contained  either  100  or  300  hidden  nodes.  The  architecture  was  trained 
by  gradient  ascent  in  log  likelihood,  assuming  a  multinomial  probability  model  for 
gates  and  experts.  Therefore,  the  output  non-linearity  of  all  MLP’s  was  the  softmax 
function. 

•  Single  node  MLP:  IIME’s  consisting  of  only  one  single  expert  node,  containing 
a  2-laycr  MLP  with  500  hidden  nodes.  This  architecture  is  comparable  to  ICSI’s 
hybrid  system  based  on  MLP’s. 

•  Gender  dependent  MLP  nodes:  Separate  MLP-IIME’s  trained  on  male  and 
female  speakers,  respectively.  After  training,  the  two  gender  dependent  IIME’s 
were  combined  to  a  new  IIME,  introducing  an  additional  top-level  gate.  The  whole 
architecture  was  then  retrained  for  one  additional  iteration.  This  form  of  initalizing 
an  IIME  resembles  the  Meta-Pi  paradigm,  as  introduced  in  [18]. 


Results  for  the  above  systems  arc  summarized  in  the  following  table: 


System 

nodes 

#  params 

#itcr 

itime 

we 

Subs 

Dels 

Ins 

WA 

4 

23.2% 

10.7% 

8.4% 

57.7% 

530k 

3 

22.4% 

9.8% 

9.5% 

58.3% 

GLIM 

421k 

9 

HI 

9.1% 

57.9% 

MLP 

962k 

3 

8.1% 

60.8% 

IIME-4 

iHlSil 

420k 

_ 1 

17h 

68.5% 

21.9% 

9.6% 

9.3% 

59.2% 

IIME-5 

MLP 

l.OM 

30h 

69.6% 

20.6% 

9.8% 

7.9% 

61.7% 

In  this  table,  j^itcr  stands  for  the  number  of  training  iterations  that  were  performed 
and  itime.  stands  for  the  amount  of  time  required  for  one  iteration  through  the  training 
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data  (measured  on  a  DEC  alpha  workstation).  WC,  Subs,  Dels,  Ins  and  WA  are  abbre¬ 
viations  for  word  correct  rate,  substitution  rate,  deletion  rate,  insertion  rate  and  word 
accuracy,  respectively. 

We  achieved  the  best  results  with  systems  that  used  MLP’s  as  node  classifiers.  However 
this  is  largely  due  to  the  fact,  that  these  systems  had  more  parameters  than  the  ones  that 
were  based  on  GLlM’s.  Larger  GLIM  based  llME’s  have  the  disadvantage  of  increased 
tree  traversal  overhead  during  training  and  testing. 


8.5  CD  Systems 

Next,  we  trained  and  tested  context-dependent  hybrid  systems.  Since  the  context-de¬ 
pendent  posteriors  arc  modeled  by  independent  sets  of  Cl  and  CD  llME’s,  the  context 
lIME’s  can  be  trained  separately.  Also,  the  context  IIME’s  are  trained  on  much  smaller 
training  sets,  depending  on  the  priors  of  the  corresponding  monophones.  Therefore,  the 
complexity  of  context  llME’s  can  be  kept  low,  which  is  favourable  both  in  terms  of  the 
number  of  additional  parameters  and  in  the  additional  training  time.  For  this  thesis,  we 
trained  context  llME’s  consisting  of  only  one  expert  node,  a  multinomial  GLIM.  This 
requires  only  a  very  modest  increase  in  the  number  of  parameters  and  in  the  training 
time.  From  our  continuous  density  IIMM  recognizer,  a  polyphone  clustering  decision  tree 
with  2000  context  classes  was  available.  This  tree  can  be  shrinked  to  any  desired  number 
of  context  classes.  We  used  trees  with  500  and  1000  context  classes  for  our  experiments. 
Training  the  context  IIME’s  took  only  about  2-5  hours  and  required  only  one  iteration 
through  the  training  data.  After  the  context  IIME’s  have  been  trained,  they  were  used  to 
augment  some  of  the  context-independent  hybrid  systems  presented  in  the  last  section. 
The  following  table  summarizes  the  results  for  the  context-dependent  hybrid  IIME/IIMM 
systems; 


System 

Type 

Cl 

CD-500 

CD-1000 

WA^ 

#  param 

WA 

^  param 

WA 

^  param 

IIME-CD-1 

GLLM-2-4 

57.7% 

421k 

60.8% 

501k 

63.8% 

581k 

IIME-CD-2 

MLP-1-4 

60,8% 

962k 

61.7 

1.06M 

65.8% 

1.14M 

ILME-CD-3 

MLP-GD 

61.7% 

l.OM 

N/A 

1.08M 

67.1% 

1.16M 

The  numbers  reported  in  the  WA  columns  arc  word  accuracies.  The  best  hybrid 
IIME/IIMM  system  achieved  a  word  accuracy  of  67.1%  using  1000  context  classes.  Our 
context-dependent  continuous-density  mixture  of  Gaussians  IIMM  recognizer  currently 
achieves  between  71%  and  73.1%  modeling  5000  context  classes  with  ticd-mixturcs  over 
2000  distinct  codebooks.  This  system  contains  over  4  million  parameters,  which  is  4-8 
times  more  than  observed  in  the  neural  network  systems,  that  we  analyzed  for  this  thesis. 
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Furthermore,  decoding  speed  is  about  2-5  times  faster  for  the  hybrid  system,  rendering  it 
useful  for  near-realtime  decoding  (i.c.  demo  situations).  Fig.  8.2  gives  an  overview  of  the 
performance  of  the  various  hybrid  systems  in  terms  of  word  accuracy. 


Word  Accurades  for  hybrid  HME/HMM  systems 


Figure  8.2:  Word  accuracies  for  several  hybrid  IIME/IIMM  systems 


8.6  CD  Smoothing 

In  our  context-dependent  hybrid  IIMM  system,  we  estimate  scaled  acoustic  model  likeli¬ 
hoods  the  following  way: 

,  X  F(cj|wi,Sfc,x)  p(a;i|sfc,x)  p{sk\x) 

As  in  [30],  wo  introduce  a  smoothing  factor  for  the  context  dependent  posteriors 
in  order  to  compensate  different  dynamic  ranges  of  context-independent  and  context- 
dependent  posteriors.  The  above  likelihood  estimation  is  therefore  modified  to  include  a 
contoxt-dopondont  likelihood  scaling  factor  7  with  0.0  <  7  <  1-0 
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A  smoothing  factor  -y  =  1.0  corresponds  to  the  original  likelihood  estimation,  where 
context-dependent  and  context-independent  scaled  likelihoods  are  weighted  equally.  As 
7  goes  towards  zero,  the  contribution  of  the  context-dependent  IIME’s  is  reduced.  Fbr 
7  =  0.0  the  system  degenerates  to  a  context-independent  system,  context-dependent 
likelihood  estimates  are  fully  suppressed. 


0  0.2  0.4  0.6  0,8  1 

smoothing  factor 

Figure  8.3:  Smoothing  context-dependent  scaled  likelihoods 

The  effect  of  this  kind  of  smoothing  can  be  seen  in  Fig.  8.3,  which  shows  the  word 
accuracy  for  different  smoothing  factors  applied  to  the  IIME-CD-2  system. 

In  this  experiment,  the  word  accuracy  of  the  system  could  be  improved  by  1.1%  with 
a  smoothing  factor  of  0.8.  Instead  of  using  just  one  single  smoothing  factor  for  all  the 
context-dependent  IIME’s,  it  might  be  advantageous  to  have  separate  smoothing  factors 
for  each  one  of  the  context-dependent  IIME’s.  In  principle,  this  option  is  available  in 
the  current  implementation  of  the  hybrid  system.  However,  a  learning  algorithm  for 
the  smoothing  factors  must  be  implemented,  because  they  can  no  longer  be  adapted  by 
sampling  the  word  accuracy.  This  might  be  done  in  future  work. 


8.7  Prior  Division  and  SDN 

Our  implementation  of  the  hybrid  system  allows  the  selective  activation  of  each  single 
IIME.  This  allows  to  experiment  with  different  setups,  without  having  to  boot  new  systems 
from  scratch.  For  instance,  a  context-dependent  system  can  easily  be  switched  to  a 
context-independent  one  by  turning  off  all  the  context  networks.  Furthermore,  the  state 
discriminating  network  (SDN)  in  a  multi-state  topology  can  also  be  switched  on  and  off. 
To  experimentally  check  the  validity  of  theoretical  results,  we  performed  several  test  runs 
with  the  SDN  enabled  and  disabled,  respectively.  The  results  were  consistent  with  the 
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theory  for  all  tests.  The  systems  with  disabled  SDN  were  always  2-3%  worse  than  the 
ones  with  the  SDN  enabled,  in  terms  of  word  accuracy. 

Division  of  network  outputs  by  class  prior  probabilities  was  observed  to  boost  perfor¬ 
mance  also.  However,  in  some  cases  where  we  trained  the  networks  on  relatively  small 
amounts  of  data,  wo  found  that  prior  division  had  the  opposite  effect  of  decreasing  overall 
performance.  Since  prior  probabilities  are  estimated  by  relative  frequencies  in  the  training 
set,  smaller  training  sot  sizes  will  load  to  poorer  estimates  of  class  priors.  Especially  when 
some  of  the  classes  have  very  low  priors,  a  largo  training  sot  is  inevitable. 


8.8  Analyzing  the  Systems 

A  hybrid  speech  recognition  system  should  not  only  bo  evaluated  in  terms  of  word  accuracy 
or  word  error  rates.  Wo  will  therefore  take  a  closer  look  at  some  other  aspects  of  the  hybrid 
recognition  process. 

8.8.1  Sample  Hypotheses 

Taking  a  closer  look  at  some  of  the  recognizer’s  hypotheses  can  provide  insight  in  the  kind 
of  errors  that  arc  made.  Also,  it  is  interesting  to  compare  recognition  hypotheses  from  a 
hybrid  and  a  traditional  system.  Fpllowing  is  a  list  of  typical  false  recognition  hypotheses 
of  the  traditional  IIME  (TRD)  and  the  hybrid  IIME  (IIYB)  system  together  with  the 
correct  reference  (REF): 

REF:  Okay  that’s  fine  so  Wednesday  the  third  at  the  coffee  shop 
TRD:  We  could  do  it  so  fine  so  Wednesday  the  third  at  coffee  shop 
HYB:  Okay  that  sounds  fine  so  Wednesday  the  third  at  that  coffee  shop 

REF;  should  we  meet  again  sometimes 

TRD:  with  with  should  we  meet  again  some  times  with 

HYB:  should  we  meet  again  some  times 

REF:  Well  would  you  be  free  on  friday  the  eighth 

TRD:  hours  now  would  June  be  you  free  on  friday  the  eighth 

HYB:  I’m  then  Ron  would  you  be  free  on  friday  the  eighth 

REF:  okay  see  you  then 

TRD:  okay  see  you  then 

HYB:  okay  see  you  then  is 

REF;  yes  today  is  jcuiuary  the  fourth  so  yeah  tomorrow  is  that  okay 
TRD:  yes  two  days  January  four  so  yeah  tomorrow  is  that  okay 
HYB:  I  yesterday  January  the  four  so  I’m  yeah  tomorrow  is  that  okay 
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Generally,  both  systems  commit  errors  in  the  same  regions.  However,  there  are  also 
parts,  wore  one  of  the  systems  is  detecting  the  right  words  wereas  the  other  system  is 
completely  wrong  and  vice  versa.  This  encourages  the  exploration  of  systems,  where 
observation  likelihoods  arc  computed  as  a  combination  of  neural  network  and  parametric 
mixture  methods. 

8.8.2  Gating  Probability  Diagrams 

One  of  the  advantages  of  IIME’s  over  monolithic  neural  networks  is  the  distributed  way 
of  solving  the  classification  task.  To  demonstrate  how  the  IIME’s  that  we’ve  trained  on 
ESST  data  behave  in  terms  of  gating  emd  distributing  responsibility  among  experts,  we 
developed  a  tool  that  allows  to  plot  gating  probabilities  (expert  activations)  over  time 
for  an  IIME.  Fig.  8.4  shows  such  a  plot  for  the  mid-state  IIME  of  the  IIME-1  system 
presented  earlier.  The  IIME  consists  of  16  experts  and  5  gates,  organized  in  a  2-lcvcl 
tree  of  branching  factor  4.  The  plot  was  generated  by  computing  IIME  activations  along 
a  forced  alignment  of  a  recognized  hypotheses.  It  also  contains  vertical  lines  indicating 
word  boundaries. 


+I0ISE't'  I  ’H  ALSO  FREE  AFTER  ORE  PH  OK  HEDIESDAY  THE  FOURTH  SO  WHAT  ABOUT  AROUKD  TWO 


Figure  8.4:  Expert  activations  over  time  for  IIME-2-4 

The  above  plot  reveals  some  interesting  aspects  of  our  hybrid  IIME  system.  The 
beginning  and  ending  part  of  the  above  utterance  contains  long  noise  parts,  which  coincide 
with  strong  activations  of  just  two  experts  (number  10  and  11  from  top  to  bottom). 
Exports  number  2,13  and  16  are  contributing  most  during  speech  segments.  There  are 
also  some  experts,  which  arc  hardly  ever  active  at  all  (1,6  and  8,  for  instance).  However, 
we  found,  that  in  other  utterances,  spoken  by  different  speakers,  some  of  these  experts 


^.8.  ANALYZING  THE  SYSTEMS 


83 


show  different  behaviour  and  arc  contributing  to  the  IIME’s  decision.  Nevertheless,  some 
experts  arc  subject  to  pruning,  because  their  contribution,  cummulatcd  over  a  set  of  test 
utterances,  is  too  low  to  be  of  any  significance. 

8.8.3  Phoneme  Recognition 

To  analyze  the  frame  accuracy  of  the  hybrid  recognizer,  we  computed  monophonc  clas¬ 
sification  error  rates  and  monophonc  confusion  matrices.  Since  the  confusion  matrix  for 
a  system  with  52  monophoncs  is  rather  big,  we  decided  to  present  a  sorted  list  of  top-5 
confusions  for  each  monophonc  instead.  Appendix  B  contains  such  a  confusion  table.  In 
the  first  column,  it  lists  all  monophoncs  with  their  counts  as  measured  on  a  list  of  100 
utterances.  The  remaining  columns  contain  the  top-5  confusion  candidates,  including  the 
actual  monophonc  itself,  together  with  the  confusion  percentage. 

Most  confusions  arc  consistent  with  what  we  would  expect,  but  there  arc  also  some 
confusions  which  appear  to  be  less  obvious.  The  following  list  contains  some  observations 
regarding  the  confusion  table: 

•  The  phone  priors  arc  distributed  highly  non-uniform,  some  phones  arc  very  rare  (for 
instance  OY  and  ZII). 

•  The  noise  modeling  phones  (indicated  with  a  leading  -b)  arc  mostly  confused  among 
themselves.  Two  noise  phones  appear  to  have  extremely  low  prior  probabilities 
(-bLA  and  -bTII). 

•  The  vowels  arc  mostly  confused  with  other  vowels. 

•  The  phone  NG  is  often  confused  with  the  phone  N. 

•  The  phone  R  is  often  confused  with  AXR. 

•  The  phones  M  and  N  arc  both  recognized  with  about  60%  correct  rate  but  the  phone 
M  is  much  more  often  confused  with  an  N  than  vice  versa. 

•  The  silence  phone  SIL  is  recognized  with  the  highest  accuracy  (96.5%). 

•  The  average  monophonc  classification  error  rate  was  observed  to  be  between  35% 
and  42%  for  the  different  systems. 
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Chapter  9 
Conclusions 


9.1  Summary 

Wc  developed  a  modular  neural  network  based  system  for  estimating  (scaled)  emission 
probabilities  in  a  IIMM  speech  recognizer.  It  is  based  on  generalized  hierarchical  mixtures 
of  experts  (IIME),  allowing  the  integration  of  arbitrary  neural  network  models  into  tree 
structured  classifiers.  We  contributed  some  original  work  to  both  the  field  of  IIME’s  in 
general  and  the  field  of  hybrid  systems: 

•  We  presented  a  constructive  algorithm  for  IIME’s  based  on  likelihood  partitioning 
among  experts.  It  is  considerably  less  expensive  than  standard  decision  tree  growing 
algorithms  which  require  the  evaluation  of  potential  splits  for  all  leaves. 

•  We  investigated  an  alternative  parameterization  for  both  gates  and  experts  -  a 
mixture  of  Gaussian  Exports  (MGE).  In  this  architecture,  every  node  consists  of  a 
Gaussian  classifier  instead  of  the  usual  generalized  linear  model  (GLIM).  Wo  showed 
that  the  MGE  offers  a  variety  of  initialization  techniques  which  allow  to  train  it  even 
faster  than  an  IIME. 

•  Wo  developed  a  connoctionist  acoustic  context  modeling,  based  on  factoring  context 
dependent  acoustic  posterior  probabilities.  Polyphonic  acoustic  contexts  are  clus¬ 
tered  by  decision  trees,  which  wc  adopt  from  a  mixture  of  Gaussians  based  IIMM 
recognizer.  Wc  showed,  that  such  explicit  modeling  of  context  improves  the  hybrid 
recognizer’s  performance  significantly. 

The  hybrid  IIMM  system  presented  in  this  thesis  offers  many  advantages  over  tradi¬ 
tional  mixture  of  Gaussians  based  systems.  It  contains  considerably  less  parameters  and 
allows  faster  decoding,  especially  when  pruning  is  enabled.  Furthermore,  training  time 
requirements  have  been  reduced  compared  to  other  hybrid  systems,  which  arc  based  on 
monolithic  neural  networks.  However,  further  optimizations  arc  necessary  to  fully  exploit 
the  potential  of  this  technology. 
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9.2  Further  Work 

The  presented  system  can  be  enhanced  in  various  ways.  Some  of  the  ideas  that  came 
up  during  the  evaluation  of  the  current  system  arc  summarized  here.  We  believe,  that 
the  presented  system  still  has  a  lot  of  potential  for  improvement.  Fbr  instance,  the 
various  learning  and  testing  parameters  (especially  for  decoding)  arc  most  probably  not 
yet  optimal.  Further  work  might  concentrate  on  the  following  issues: 

•  Mixture  of  likelihood  estimators 

The  idea  of  multiple  experts,  whose  decisions  arc  combined  by  a  gate  can  also 
be  applied  at  higher  levels  in  a  speech  recognizer.  A  hybrid  system  relies  on  dis- 
criminatively  trained  neural  networks  for  (sealed)  likelihood  estimation  whereas  a 
traditional  IIMM  system  is  based  on  parametric  mixture  densities.  A  system  should 
benefit  from  the  combination  of  both  techniques  by  a  gating  or  mediator  model  on 
top  of  the  two  (or  possibly  more)  acoustic  experts.  In  this  ease,  the  objective  is  to 
maximize  the  combined  estimates  of  the  acoustic  likelihood.  However,  gain  factors 
need  to  be  applied  to  the  different  acoustic  experts  estimates,  in  order  to  account 
for  the  different  scales. 

•  Unsupervised  ML  adaptation 

Unsupervised  speaker  adaptation  has  proven  useful  in  traditional  IIMM  speech  rec¬ 
ognizers.  A  (usually  linear)  transformation  of  the  parameter  space  is  iteratively 
updated  by  maximum  likelihood  when  several  utterances  of  a  particular  speaker 
occur.  The  same  principle  can  be  applied  to  a  hybrid  system.  .Additional  front-end 
networks,  which  compute  a  linear  transformation  of  the  feature  space  can  be  used  to 
account  for  speaker  variations.  Training  labels  for  the  front-end  linear  networks  can 
be  generated  by  back-propagating  errors  resulting  from  a  Vitcrbi-alignmcnt  of  de¬ 
coder  hypothesis.  Note,  that  this  kind  of  speaker  adaptation  can  also  be  interpreted 
as  a  speaker  adaptive  LDA. 

•  Improving  convergence  speed 

The  GEM  and  gradient  ascent  algorithms  which  we  presented  for  the  IIME  archi¬ 
tecture  arc  subject  to  lots  of  additional  optimization  techniques  to  improve  their 
convergence  speed.  We  already  employed  methods  such  as  momentum  terms  and 
on-line  stochastic  gradients.  Especially  when  MLP’s  arc  used  as  gates  and  experts, 
learning  parameter  optimization  is  crucial  to  reduce  the  number  of  required  training 
iterations.  Although  the  presented  system  can  be  trained  in  2-‘J  days  on  standard 
workstations,  a  further  decrease  in  training  time  is  desirable. 


•  Incorporating  additional  knowledge  sources 

The  IIME  architecture  allows  in  principle  the  use  of  different  feature  spaces  for 
gates  and  experts.  Why  not  supplying  the  gates  with  additional  features  such  as  an 
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estimate  of  the  speaking  rate,  gender  or  dialect  region?  Together  with  pre-trained 
experts,  the  classification  task  may  become  easier  and  the  whole  architecture  may 
be  trainable  much  faster. 

•  Speaker/Utterance  clustering 

Although  it  is  a  well  known  fact,  that  acoustic  features  arc  highly  speaker  dependent, 
most  IIMM  recognizers  make  use  of  a  single  set  of  parameters  for  all  speakers  or  at 
most,  distinguish  between  male  and  female  speakers.  In  the  ease  of  speech  databases 
with  a  high  degree  of  speaker  variability,  it  might  be  more  effective  to  cluster  similar 
speakers  into  groups  which  then  arc  used  to  train  a  set  of  neural  networks.  These 
pro-trained  neural  networks  can  then  easily  be  integrated  and  trained  further  as 
IIME’s. 

•  Learning  CD  smoothing  factors 

We  introduced  a  smoothing  factor  between  context  independent  and  context  depen¬ 
dent  network  outputs  which  was  shown  to  improve  performance  over  a  non-smoothed 
system.  We  were  using  a  single  smoothing  factor  for  all  the  context  networks  in  our 
system.  Our  system  also  allows  a  separate  smoothing  factor  for  each  one  of  the 
context  networks.  However,  it  remains  to  derive  a  learning  algorithm  for  these 
smoothing  factors  (maximum  likelihood).  Separate  smoothing  factors  will  provide 
a  better  information  scaling  between  the  Cl  and  CD  networks. 

•  Dynamic  score  scaling  factor 

We  discovered  large  differences  in  the  number  of  insertions  and  deletions  among  the 
decoded  test  set  utterances.  In  some  cases,  the  insertion  rate  is  much  higher  than 
the  deletion  rate,  indicating  that  the  word  insertion  penalty  is  too  low.  In  other 
eases  however,  the  opposite  behaviour  can  be  observed  (for  the  same  language  model 
parameters).  It  seems,  that  the  variation  in  the  acoustic  scores  leads  to  different 
relative  weights  of  the  language  model  parameters.  An  adaptive  score  scaling  factor 
might  help  to  overcome  this  effect. 

•  Confidence  measure  based  on  posteriors 

Since  the  acoustic  models  in  a  hybrid  system  are  trained  discriminatively,  it  might  be 
useful  to  derive  a  phone  or  word  confidence  measure  based  on  the  networks  estimates 
of  frame  posteriors.  Furthermore,  a  simple  measure  of  the  frame  confidence  (such  as 
the  difference  in  score  between  the  best  and  the  second  best  acoustic  model)  might 
be  useful  to  dynamically  adjust  the  search  beam  during  decoding. 


Appendix  A 

Question  Set  for  Decision  Trees 


Question-Name 

Set  of  Phonemes  covered 

NOISES 

-hBR  -I-HU  +NH  -fSM  -t-TH  -|-LA 

HUMAN-NOISES 

+BR  -f  HU  -hSM  -t-TH  -PLA 

LAUGHTER 

+LA 

UHHUH 

+F 

SILENCES 

SIL 

CONSONANT 

P  B  F  V  TH  DH  T  D  S  Z  SH  ZH  CH  JH  K  G  HH  M  N  NG  R 

Y  W  L  ER  DX  AXR 

CONSONANTAL 

P  B  F  V  TH  DH  T  D  S  Z  SH  ZH  CH  JH  K  C  HH  M  N  NG  DX 

OBSTRUENT 

P  B  F  V  TH  DH  T  D  S  Z  SH  ZH  CH  JH  K  G 

SONORANT 

M  N  NG  R  Y  W  L  ER  AXR  DX 

SYLLABIC 

AY  OY  EY  lY  AW  OW  EH  IH  AO  AE  AA  AH  UW  UH  IX  AX 
ER  AXR 

VOWEL 

AY  OY  EY  lY  AW  OW  EH  IH  AO  AE  AA  AH  UW  UH  IX  AX 

DIPHTHONG 

AY  OY  EY  AW  OW' 

CARDVOWEL 

lY  IH  EH  AE  AA  AH  AO  UH  UW'  IX  AX 

VOICED 

B  D  G  JH  V  DH  Z  ZH  M  N  NG  W'  R  Y  L  ER  AY  OY  EY  lY 
AW'  OW'  EH  IH  AO  AE  AA  AH  UW'  UH  DX  AXR  IX  AX 

UNVOICED 

P  F  TH  T  S  SH  CH  K 

CONTINUANT 

F  TH  S  SH  V  DH  Z  ZH  W  R  Y  L  ER 

DEL-REL 

CH  JH 

LATERAL 

L 

ANTERIOR 

P  T  B  D  F  TH  S  SH  V  DH  Z  ZH  M  N  W  Y  L  DX 

CORONAL 

T  D  CH  JH  TH  S  SH  DH  Z  ZH  N  L  R  DX 

APICAL 

T  D  N  DX 

HIGH-CONS 

K  G  NG  W'  Y 

BACK-CONS 

K  G  NG  W' 

LABIALIZED 

R  W'  ER  AXR 

STRIDENT 

CH  JH  F  S  SH  V  Z  ZH 
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Question-Name 

Set  of  Phonemes  covered 

SIBILANT 

S  SH  Z  ZH  CH  JH 

BILABIAL 

P  B  M  W' 

LABIODENTAL 

F  V 

LABIAL 

P  B  M  W'  F  V 

INTERDENTAL 

TH  DH 

ALVEOLAR-RIDGE 

T  D  N  S  Z  L  DX 

ALVEOPALATAL 

SH  ZH  CH  JH 

ALVEOLAR 

T  D  N  S  Z  L  SH  ZH  CH  JH  DX 

RETROFLEX 

R  ER  AXR 

PALATAL 

Y 

VELAR 

K  G  NG  W' 

GLOTTAL 

HH 

ASPIRATED 

HH 

STOP 

PBTDKGMNNG 

PLOSIVE 

P  B  T  D  K  G 

FLAP 

DX 

NASAL 

M  NNG 

FRICATIVE 

F  V  TH  DH  S  Z  SH  ZH  HH 

AFFRICATE 

CH  JH 

APPROXIMANT 

R  L  Y  W 

LAB-PL 

P  B 

ALV-PL 

T  D 

VEL-PL 

KG 

VLS-PL 

P  T  K 

VCD-PL 

B  D  G 

LAB-FR 

F  V 

DNT-FR 

TH  DH 

ALV-FR 

SH  ZH 

VLS-FR 

F  TH  SH 

VCD-FR 

VDH  ZH 

ROUND 

AO  OW'  UH  UW'  OY  AW'  OW 

HIGH- VOW 

lY  IH  UH  UW'  IX 

MID-VOW 

EH  AH  AX 

LOW-VOW' 

AA  AE  AO 

FRONT-VOW' 

lY  IH  EH  AE 

CENTRAL-VOW' 

AH  AX  IX 

BACK-VOW' 

AA  AO  UH  UW' 

TENSE-VOW' 

lY  UW'  AE 

LAX-VOW' 

IH  AA  EH  AH  UH 

ROUND-VOW' 

AO  UH  mv 

REDUCED-VOW' 

IX  AX 

REDUCED-CON 

AXR 

Question-Name 

Set  of  Phonemes  covered 

REDUCED 

IX  AX  AXR 

LH-DIP 

AY  AW 

MH-DIP 

OY  OW^  EY 

BF-DIP 

AY  OY  AW  OW 

Y-DIP 

AY  OY  EY 

W-DIP 

AW  OW 

ROUND-DIP 

OY  AW  OW 

LIQUID-GLIDE 

L  R  W  Y 

W-GLIDE 

UW  AW  OW  W 

LIQUID 

L  R 

LW 

L  W 

Y-GLIDE 

lY  AY  EY  OY  Y 

LQGL-BACK 

L  R  W' 
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Appendix  B 

Monophone  Confusion  Table 


Correct  Flione 


+BH, 14988  +BH  (53.776%)  +SM  (14.905%)  I  SIL  (7.986%)  I  +NH  (7.686%) 


+F,9129  +F  (52.788%)  +BH  (12.794%)  AY  (4.962%)  +HU  (3.626%) 


+HU,ino  II  SIL  (11.754%)  I  +BH  (11.345%)  I  +F  (9.240%) 


1  l>l  I II  li  1 1  ■saaidnitiaMi 


AE,424S 


AH, 4396 


AO, 4057 


AW  ,30  58 


.AX,4390 


+HU  (S.129%) 


+HU  (0.647%) 


21.655%) 


H  (3^03%) 


59.890%) 


68.657%) 


43.281%) 


EH, 3703 

1  EH  (63.003%) 

EY,5590 

71.521%) 

F  (84.390%) 


G  (46.495%) 


IH  (9.207%) 


AXH  (26.419%) 


6.967%) 


8.582%) 


12.701%) 


r  (8.300%) 


3.951%) 


MWeigPEMiKfcWMiillM 

■n¥EglBgEaM»Wi¥t’jei<BSMi»\'MtW!>.->aWMaiWMlt^!tr~ 

HSRIEE^H 

ii?g<cEgHi7aMM!i>wrtigagiM»ii-.^^fi»iBivaM»a:¥^jKiM»i»»icgiiEKai 

mvmunMi’jsmmwmbKi^ — 


+BH  (1.712%) 


AY  (5.138%) 


6.560%) 


IH  (9.442%)  I  EY  (9.277%) 


K  (7.510%) 


DH  (3.717%) 


TH  (3.495%) 


4.076%) 


+SM  (1.202%) 


N  (3.400%) 


2.750%) 


46.463%) 


15.756%) 


r  (6.304%) 


N, 13024 

IHEX 

63.506%) 

HI 

6.741%) 

NG,1751 

N  (29.640%) 

OW  (38.623%) 

HX 

9.557%) 

OW  (6.138%)  I  A 

+F  (3.515%) 
.413%)  1  +F  (3.209%) 


M  (5.768%) 


AX  (4.252%) 


iX  (1.756%) 


3.698%) 


FBH  (2.598%) 


tfXiiWAm 


W  (1.481%) 


2.434%) 


ig;KM.i.i.to!n 
iegiiiia«¥iir 


■»\»fc!.I.I.IiE!n 


iKssasiE^Bsa _ _ 

raESKfi 


AO  (10.000%) 


r  (7.043%) 
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Correct  Phone 

1 

2 

3 

4 

B,S70S 

S  (83.032%) 

Z  (6.431%) 

F  (2.664%) 

BH  (03.266%) 

B  (13.903%) 

CH  (8.878%) 

'1'  (7.873%) 

+BR  (1.039%) 

+BM  (0.631%) 

+NH  (0.281%) 

K  (0.143%) 

r  (61 .266%) 

K  (7.779%) 

B  (3.210%) 

B  (2.926%) 

+BM  (2.449%) 

•rH,4353 

■CTCHE^iM 

F  (10.700%) 

+BR  (4.663%) 

BIL  (4.319%) 

+SM  (4.273%) 

UH,1147 

■iiiieriiiWii'Aa 

IH  (7.934%) 

AX  (6.103%) 

UW,4354 

UW  (02.026%) 

IH  (4.410%) 

■aaaHtj-mi 

V,25e5 

V  (42.100%) 

F  (16.720%) 

M  (4.400%) 

W,5S49 

W  (69.340%) 

L  (0.779%) 

AO  (4.360%) 

Y,17G5 

Y  (47.008%) 

lY  (28.262%) 

UW  (3.300%) 

IH  (2.963%) 

mwmaimiim 

Z.2994 

Z  (09.987%) 

B  (29.120%) 

■ySB6T6T*ViOM 

ZH,6 

lY  (00.000%) 

waignamiiAi 

+BH  (0.000%) 
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